A Statistical Approach to Domain Independent Text Segmentation
-
- UTIYAMA MASAO
- Communications Research Laboratory
-
- ISAHARA HITOSHI
- Communications Research Laboratory
Bibliographic Information
- Other Title
-
- 統計的手法による分野非依存のテキスト分割
- トウケイテキ シュホウ ニ ヨル ブンヤ ヒイゾン ノ テキスト ブンカツ
Search this article
Abstract
A text is usually composed of multiple topics. Segmenting such a text into coherent topics is useful both for information retrieval and for automatic text summarization. This paper proposes a statistical method that selects the segmentation of the highest probability among possible segmentations as the best segmentation of the given text. Since the method estimates probabilities of segmentations from the given text, it does not need training data. Therefore, it can be applied to any text in any domain. The effectiveness of the method was confirmed through twoexperiments. The firstexperiment evaluated the accuracy of the method by using publicly available data. The experimental results showed that the accuracy of the proposed method is at least as good as that of a state-of-the-art text segmentation system. The second experiment compared the segmentations done by our method with those of original segments in relatively long documents. When we compared our system's segmentations with chapters in the documents, the accuracy was 0.37 on the condition that we regarded only exact matches as correct matches. If we regarded ±1 line differences as correct then the accuracy was 0.49. When we compared our system's segmentations with sections, the accuracies were 0.34 and 0.51, respectively. These results show that our method is effective for domain independent text segmentation.
Journal
-
- Journal of Natural Language Processing
-
Journal of Natural Language Processing 8 (4), 19-36, 2001
The Association for Natural Language Processing
- Tweet
Keywords
Details 詳細情報について
-
- CRID
- 1390001204475892736
-
- NII Article ID
- 10021991573
-
- NII Book ID
- AN10472659
-
- ISSN
- 21858314
- 13407619
-
- NDL BIB ID
- 5941296
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
- NDL
- Crossref
- CiNii Articles
-
- Abstract License Flag
- Disallowed