Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering
-
- Tatsuya Nakamura
- Osaka University, Osaka, Japan
-
- Masumi Shirakawa
- Hapicom Inc., Japan and Osaka University, Osaka, Japan
-
- Takahiro Hara
- Osaka University, Osaka, Japan
-
- Shojiro Nishio
- Osaka University, Osaka, Japan
書誌事項
- 公開日
- 2018-12-14
- 資源種別
- journal article
- 権利情報
-
- https://www.acm.org/publications/policies/copyright_policy#Background
- DOI
-
- 10.1145/3276473
- 公開者
- Association for Computing Machinery (ACM)
この論文をさがす
説明
<jats:p>Throughout the world, people can post information about their local area in their own languages using social networking services. Multilingual short text clustering is an important task to organize such information, and it can be applied to various applications, such as event detection and summarization. However, measuring the relatedness between short texts written in various languages is a challenging problem. In addition to handling multiple languages, the semantic gaps among all languages must be considered. In this article, we propose two Wikipedia-based semantic relatedness measurement methods for multilingual short text clustering. The proposed methods solve the semantic gap problem by incorporating the inter-language links of Wikipedia into Extended Naive Bayes (ENB), a probabilistic method that can be applied to measure semantic relatedness among monolingual short texts. The proposed methods represent a multilingual short text as a vector of the English version of Wikipedia articles (entities). By transferring texts to a unified vector space, the relatedness between texts in different languages with similar meanings can be increased. We also propose an approach that can improve clustering performance and reduce the processing time by eliminating language-specific entities in the unified vector space. Experimental results on multilingual Twitter message clustering revealed that the proposed methods outperformed cross-lingual explicit semantic analysis, a previously proposed method to measure relatedness between texts in different languages. Moreover, the proposed methods were comparable to ENB applied to texts translated into English using a proprietary translation service. The proposed methods enabled relatedness measurements for multilingual short text clustering without requiring machine translation processes.</jats:p>
収録刊行物
-
- ACM Transactions on Asian and Low-Resource Language Information Processing
-
ACM Transactions on Asian and Low-Resource Language Information Processing 18 (2), 1-25, 2018-12-14
Association for Computing Machinery (ACM)
- Tweet
詳細情報 詳細情報について
-
- CRID
- 1360004236284629376
-
- DOI
- 10.1145/3276473
-
- ISSN
- 23754702
- 23754699
-
- 資料種別
- journal article
-
- データソース種別
-
- Crossref
- KAKEN
- OpenAIRE
