Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering

Tatsuya Nakamura, Masumi Shirakawa, Takahiro Hara, Shojiro Nishio

doi:10.1145/3276473

<jats:p>Throughout the world, people can post information about their local area in their own languages using social networking services. Multilingual short text clustering is an important task to organize such information, and it can be applied to various applications, such as event detection and summarization. However, measuring the relatedness between short texts written in various languages is a challenging problem. In addition to handling multiple languages, the semantic gaps among all languages must be considered. In this article, we propose two Wikipedia-based semantic relatedness measurement methods for multilingual short text clustering. The proposed methods solve the semantic gap problem by incorporating the inter-language links of Wikipedia into Extended Naive Bayes (ENB), a probabilistic method that can be applied to measure semantic relatedness among monolingual short texts. The proposed methods represent a multilingual short text as a vector of the English version of Wikipedia articles (entities). By transferring texts to a unified vector space, the relatedness between texts in different languages with similar meanings can be increased. We also propose an approach that can improve clustering performance and reduce the processing time by eliminating language-specific entities in the unified vector space. Experimental results on multilingual Twitter message clustering revealed that the proposed methods outperformed cross-lingual explicit semantic analysis, a previously proposed method to measure relatedness between texts in different languages. Moreover, the proposed methods were comparable to ENB applied to texts translated into English using a proprietary translation service. The proposed methods enabled relatedness measurements for multilingual short text clustering without requiring machine translation processes.</jats:p>

Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering

書誌事項

この論文をさがす

説明

収録刊行物

参考文献 (37)*注記

関連プロジェクト

詳細情報詳細情報について

書き出し

問題の指摘

Wikipedia-Based Relatedness Measurements for Multilingual Short Text Clustering

書誌事項

この論文をさがす

説明

収録刊行物

参考文献 (37)*注記

関連プロジェクト

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について