An integrated semi-automated framework for domain-based polarity words extraction from an unannotated non-English corpus

Document Type

Article

Publication Date

12-1-2020

Abstract

Building sentiment analysis resources is a fundamental step before developing any sentiment analysis model. Sentiment lexicons are one of these critical resources. However, many non-English languages suffer from a severe shortage of these resources and lexicons. This study proposes an integrated framework for extracting domain-based polarity words from unannotated massive non-English corpus. The framework consists of three layers, namely lexicon-based, corpus-based and human-based. The first two layers automatically recognize and extract new polarity words from a massive unannotated corpus using initial seed lexicons. A key advantage of the proposed framework is that it only needs an initial seed lexicon and unannotated corpus to start the extraction process. Therefore, the framework is semi-automated due to the use of seed lexicons. Experiments on three languages indicate the proposed framework outperformed existing lexicons, achieving F-scores of 77.8%, 83.8% and 68.6% for the Arabic, French and Malay lexicons, respectively.

Keywords

Multilingual sentiment analysis, Sentiment lexicon, Polarity words, Social media analysis, Unannotated corpus

Divisions

fsktm

Publication Title

Journal of Supercomputing

Volume

76

Issue

12

Publisher

Springer

Publisher Location

VAN GODEWIJCKSTRAAT 30, 3311 GZ DORDRECHT, NETHERLANDS

This document is currently not available here.

Share

COinS