- 【Updated on May 12, 2025】 Integration of CiNii Dissertations and CiNii Books into CiNii Research
- Trial version of CiNii Research Knowledge Graph Search feature is available on CiNii Labs
- 【Updated on June 30, 2025】Suspension and deletion of data provided by Nikkei BP
- Regarding the recording of “Research Data” and “Evidence Data”
Chinese Word Segmentation and Unknown Word Extraction by Mining Maximized Substring
-
- Shen Mo
- Graduate School of Informatics, Kyoto University
-
- Kawahara Daisuke
- Graduate School of Informatics, Kyoto University
-
- Kurohashi Sadao
- Graduate School of Informatics, Kyoto University
Search this article
Description
<p>Chinese word segmentation is an initial and important step in Chinese language processing. Recent advances in machine learning techniques have boosted the performance of Chinese word segmentation systems, yet the identification of out-of-vocabulary words is still a major problem in this field of study. Recent research has attempted to address this problem by exploiting characteristics of frequent substrings in unlabeled data. We propose a simple yet effective approach for extracting a specific type of frequent substrings, called maximized substrings, which provide good estimations of unknown word boundaries. In the task of Chinese word segmentation, we use these substrings which are extracted from large scale unlabeled data to improve the segmentation accuracy. The effectiveness of this approach is demonstrated through experiments using various data sets from different domains. In the task of unknown word extraction, we apply post-processing techniques that effectively reduce the noise in the extracted substrings. We demonstrate the effectiveness and efficiency of our approach by comparing the results with a widely applied Chinese word recognition method in a previous study.</p>
Journal
-
- Information and Media Technologies
-
Information and Media Technologies 11 (0), 181-212, 2016
Information and Media Technologies Editorial Board
- Tweet
Details 詳細情報について
-
- CRID
- 1390282680241359104
-
- NII Article ID
- 130005250472
-
- ISSN
- 18810896
-
- Text Lang
- en
-
- Data Source
-
- JaLC
- CiNii Articles
-
- Abstract License Flag
- Disallowed