An integrated semi-automated framework for domain-based polarity words extraction from an unannotated non-English corpus
Document Type
Article
Publication Date
12-1-2020
Abstract
Building sentiment analysis resources is a fundamental step before developing any sentiment analysis model. Sentiment lexicons are one of these critical resources. However, many non-English languages suffer from a severe shortage of these resources and lexicons. This study proposes an integrated framework for extracting domain-based polarity words from unannotated massive non-English corpus. The framework consists of three layers, namely lexicon-based, corpus-based and human-based. The first two layers automatically recognize and extract new polarity words from a massive unannotated corpus using initial seed lexicons. A key advantage of the proposed framework is that it only needs an initial seed lexicon and unannotated corpus to start the extraction process. Therefore, the framework is semi-automated due to the use of seed lexicons. Experiments on three languages indicate the proposed framework outperformed existing lexicons, achieving F-scores of 77.8%, 83.8% and 68.6% for the Arabic, French and Malay lexicons, respectively.
Keywords
Multilingual sentiment analysis, Sentiment lexicon, Polarity words, Social media analysis, Unannotated corpus
Divisions
fsktm
Publication Title
Journal of Supercomputing
Volume
76
Issue
12
Publisher
Springer
Publisher Location
VAN GODEWIJCKSTRAAT 30, 3311 GZ DORDRECHT, NETHERLANDS