- 【Updated on May 12, 2025】 Integration of CiNii Dissertations and CiNii Books into CiNii Research
- Trial version of CiNii Research Knowledge Graph Search feature is available on CiNii Labs
- 【Updated on June 30, 2025】Suspension and deletion of data provided by Nikkei BP
- Regarding the recording of “Research Data” and “Evidence Data”
Character-Based Thai Word Segmentation with Multiple Attentions
-
- Chay-intr Thodsaporn
- School of Engineering, Tokyo Institute of Technology
-
- Kamigaito Hidetaka
- Division of Information Science, Nara Institute of Science and Technology
-
- Okumura Manabu
- Institute of Innovative Research, Tokyo Institute of Technology
Search this article
Description
<p>Character-based word segmentation models have been extensively applied to Asian languages, including Thai, owing to their promising performance. These models estimate the word boundaries from a character sequence; however, a Thai character unit in a sequence has no inherent meaning, in contrast with word, subword, and character cluster units that represent more meaningful linguistic information. In this paper, we propose a Thai word segmentation model that uses various types of information, including words, subwords, and character clusters, from a character sequence. Our model applies multiple attentions to refine segmentation inferences by estimating the significant relationships among characters and various unit types. We evaluated our model on three Thai datasets, and the experimental results show that our model outperforms other Thai word segmentation models, demonstrating the validity of using character clusters over subword units. A case study on sample Thai text supported these results. Thus, according to our analysis, particularly the case study, our model can segment Thai text accurately, while other existing models yield incorrect results that violate the Thai writing system.</p>
Journal
-
- Journal of Natural Language Processing
-
Journal of Natural Language Processing 30 (2), 372-400, 2023
The Association for Natural Language Processing
- Tweet
Details 詳細情報について
-
- CRID
- 1390014967590302080
-
- ISSN
- 21858314
- 13407619
-
- Text Lang
- en
-
- Data Source
-
- JaLC
- Crossref
- OpenAIRE
-
- Abstract License Flag
- Disallowed