Clustering OCR-ed texts for browsing document image database

Michihiko Minoh, Koji Tsuda, Katsuo Ikeda, S. Senda

doi:10.1109/icdar.1995.598969

説明

Document clustering is a powerful tool for browsing throughout a document database. Similar documents are gathered into several clusters and a representative document of each cluster is shown to users. To make users infer the content of the database from several representatives, the documents must be separated into tight clusters, in which documents are connected with high similarities. At the same time, clustering must be fast for user interaction. We propose an O(n/sup 2/) time, O(n) space cluster extraction method. It is faster than the ordinal clustering methods, and its clusters compare favorably with those produced by Complete Link for tightness. When we deal with OCR-ed documents, term loss caused by recognition faults can change similarities between documents. We also examined the effect of recognition faults to the performance of document clustering.

収録刊行物

Proceedings of 3rd International Conference on Document Analysis and Recognition

Proceedings of 3rd International Conference on Document Analysis and Recognition 1 171-174, 2002-11-19

IEEE Comput. Soc. Press

詳細情報詳細情報について

CRID: 1872835442685058304

DOI: 10.1109/icdar.1995.598969

データソース種別

OpenAIRE

書き出し

問題の指摘

Clustering OCR-ed texts for browsing document image database

説明

収録刊行物

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について