統計的手法による分野非依存のテキスト分割

書誌事項

タイトル別名
  • A Statistical Approach to Domain Independent Text Segmentation
  • トウケイテキ シュホウ ニ ヨル ブンヤ ヒイゾン ノ テキスト ブンカツ

この論文をさがす

抄録

A text is usually composed of multiple topics. Segmenting such a text into coherent topics is useful both for information retrieval and for automatic text summarization. This paper proposes a statistical method that selects the segmentation of the highest probability among possible segmentations as the best segmentation of the given text. Since the method estimates probabilities of segmentations from the given text, it does not need training data. Therefore, it can be applied to any text in any domain. The effectiveness of the method was confirmed through twoexperiments. The firstexperiment evaluated the accuracy of the method by using publicly available data. The experimental results showed that the accuracy of the proposed method is at least as good as that of a state-of-the-art text segmentation system. The second experiment compared the segmentations done by our method with those of original segments in relatively long documents. When we compared our system's segmentations with chapters in the documents, the accuracy was 0.37 on the condition that we regarded only exact matches as correct matches. If we regarded ±1 line differences as correct then the accuracy was 0.49. When we compared our system's segmentations with sections, the accuracies were 0.34 and 0.51, respectively. These results show that our method is effective for domain independent text segmentation.

収録刊行物

  • 自然言語処理

    自然言語処理 8 (4), 19-36, 2001

    一般社団法人 言語処理学会

被引用文献 (1)*注記

もっと見る

参考文献 (26)*注記

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ