Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

Shen Mo, Kawahara Daisuke, Kurohashi Sadao

doi:10.5715/jnlp.23.235

この論文をさがす

抄録

<p>Chinese word segmentation is an initial and important step in Chinese language processing. Recent advances in machine learning techniques have boosted the performance of Chinese word segmentation systems, yet the identification of out-of-vocabulary words is still a major problem in this field of study. Recent research has attempted to address this problem by exploiting characteristics of frequent substrings in unlabeled data. We propose a simple yet effective approach for extracting a specific type of frequent substrings, called maximized substrings, which provide good estimations of unknown word boundaries. In the task of Chinese word segmentation, we use these substrings which are extracted from large scale unlabeled data to improve the segmentation accuracy. The effectiveness of this approach is demonstrated through experiments using various data sets from different domains. In the task of unknown word extraction, we apply post-processing techniques that effectively reduce the noise in the extracted substrings. We demonstrate the effectiveness and efficiency of our approach by comparing the results with a widely applied Chinese word recognition method in a previous study. </p>

収録刊行物

自然言語処理

自然言語処理 23 (3), 235-266, 2016

一般社団法人　言語処理学会

キーワード

詳細情報詳細情報について

CRID: 1390282679453256960

NII論文ID: 130005411025

NII書誌ID: AN10472659

DOI: 10.5715/jnlp.23.235

ISSN: 21858314; 13407619

NDL書誌ID: 027492546

Web Site: https://ndlsearch.ndl.go.jp/books/R000000004-I027492546; https://www.jstage.jst.go.jp/article/jnlp/23/3/23_235/_pdf

本文言語コード: en

データソース種別

JaLC
NDL
Crossref
CiNii Articles

抄録ライセンスフラグ: 使用不可

Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

この論文をさがす

抄録

収録刊行物

被引用文献 (2)*注記

参考文献 (27)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring

この論文をさがす

抄録

収録刊行物

被引用文献 (2)*注記

参考文献 (27)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について