統計的手法による分野非依存のテキスト分割

内山 将夫, 井佐原 均

doi:10.5715/jnlp.8.4_19

A text is usually composed of multiple topics. Segmenting such a text into coherent topics is useful both for information retrieval and for automatic text summarization. This paper proposes a statistical method that selects the segmentation of the highest probability among possible segmentations as the best segmentation of the given text. Since the method estimates probabilities of segmentations from the given text, it does not need training data. Therefore, it can be applied to any text in any domain. The effectiveness of the method was confirmed through twoexperiments. The firstexperiment evaluated the accuracy of the method by using publicly available data. The experimental results showed that the accuracy of the proposed method is at least as good as that of a state-of-the-art text segmentation system. The second experiment compared the segmentations done by our method with those of original segments in relatively long documents. When we compared our system's segmentations with chapters in the documents, the accuracy was 0.37 on the condition that we regarded only exact matches as correct matches. If we regarded ±1 line differences as correct then the accuracy was 0.49. When we compared our system's segmentations with sections, the accuracies were 0.34 and 0.51, respectively. These results show that our method is effective for domain independent text segmentation.

統計的手法による分野非依存のテキスト分割

書誌事項

この論文をさがす

説明

収録刊行物

被引用文献 (1)*注記

参考文献 (26)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

統計的手法による分野非依存のテキスト分割

書誌事項

この論文をさがす

説明

収録刊行物

被引用文献 (1)*注記

参考文献 (26)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について