Text Classification using Similarity of Tree Sources Estimated from Bayes Coding Algorithm

IWAMA Hiroki, ISHIDA Takashi, GOTO Masayuki

doi:10.11221/jima.64.438

Bibliographic Information

Other Title

ベイズ符号化法によって推定される木情報源の類似度を用いた自動文書分類
ベイズフゴウカホウニヨッテスイテイサレルモクジョウホウゲンノルイジドオモチイタジドウブンショブンルイ

Search this article

Abstract

In this paper, we propose a method of text classification using a Bayes coding algorithm, one of the efficient data compression methods. The Bayes coding algorithm gives the Bayes optimal data compression over the tree source model class. When data is compressed by the Bayes coding algorithm, the probability structure of information sources is implicitly estimated from the compressed data. Therefore, we can expect that the implicit estimation of data compression can be utilized for other purposes, especially for the document classification problem. As for the document classification using data compression methods, ZIP format and context tree weighting methods have been proposed. However, these methods do not have Bayes optimal compression and use the compression ratio as a similarity measure between documents for classification. In the Bayes coding algorithm, a weighted mixture tree given by the compression phase can be used for estimated probability structure. Tree source is a class of Markov sources and it is possible to measure the divergence between the tree sources with the same structure. However, the Bayes coding algorithm outputs different tree structures based on the data sequence to be compressed. Since the tree structures derived from documents are different from each other, it is difficult to measure the divergence between them just as it is. This paper proposes a new method to change the structures of weighted mixture trees into the same tree structure to be able to measure the divergence. Using the divergence between trees estimated by documents, the documents can be classified. Moreover, the effectiveness of the proposed method is clarified via a simulation experiment for the document classification with natural data.

Journal

Journal of Japan Industrial Management Association

Journal of Japan Industrial Management Association 64 (3), 438-446, 2013

Japan Industrial Management Association

Keywords

Details 詳細情報について

CRID: 1390001205504487296

NII Article ID: 10031203385

NII Book ID: AN10561806

DOI: 10.11221/jima.64.438

ISSN: 21879079; 13422618

NDL BIB ID: 024940698

Web Site: http://id.ndl.go.jp/bib/024940698; https://ndlsearch.ndl.go.jp/books/R000000004-I024940698

Text Lang: ja

Data Source

JaLC
NDL
CiNii Articles
KAKEN

Abstract License Flag: Disallowed

Export

Text Classification using Similarity of Tree Sources Estimated from Bayes Coding Algorithm

Bibliographic Information

Search this article

Abstract

Journal

References(27)*help

Related Projects

Keywords

Details 詳細情報について

Export

Report a problem

Text Classification using Similarity of Tree Sources Estimated from Bayes Coding Algorithm

Bibliographic Information

Search this article

Abstract

Journal

References(27)*help

Related Projects

Keywords

Details 詳細情報について

Export

Report a problem

Project list