Extracting String Features with Adaptation for Text Classification

  • Onoe Toru
    Information and Computer Science, Toyohashi University of Technology
  • Hirata Katsuhiro
    Information and Computer Science, Toyohashi University of Technology
  • Okabe Masayuki
    Information and Media Center, Toyohashi University of Technology
  • Umemura Kyoji
    Information and Computer Science, Toyohashi University of Technology

Bibliographic Information

Other Title
  • 文字列を特徴量とし反復度を用いたテキスト分類
  • モジレツ オ トクチョウリョウ ト シ ハンプクド オ モチイタ テキスト ブンルイ

Search this article

Description

Feature selection for text classification is a procedure that categorizes words or strings to improve the classification performance. This operation is especially important when we use substrings as a feature because the number of substrings in a given data set is usually quite large. <BR> In this paper, we focus on the substring feature selection technique and describe a method that uses a statistic score called “adaptation” as a measure for the selection. Adaptation works on the assumption that strings appearing more than twice in a document have a high probability of being keywords; we expect this feature to be an effective tool for text classification. We compared our method with a state-of-the-art method proposed by Zhang et al. that identifies a substring feature set by removing redundant substrings that are similar in terms of statistical distribution. We evaluate the classification results by F-measure that is a harmonic mean of precision and recall. An experiment on news classification demonstrated that our method outperformed Zhang’s by 3.74% (it improves Zhang’s result from 79.65% to 83.39%) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed Zhang’s by 2.93% (it improves the Zhang’s result from 90.23% to 93.15%). We verified existence of significant difference between the results in each experiment. <BR> An experiment on news classification shows that our method is worse than a method of using word for feature by 0.49% (although there is no significant difference) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed the word method by 1.04% (our method improves its result from 92.11% to 93.15%). We verified that there is a significant difference between the results in spam classification experiment. <BR> Zhang’s method tends to extract substrings that are so short it is difficult to understand the original phrases from which they are extracted. This degrades classification performance because such a substring can be a part of many different words, some or most of which are unrelated to the original substring. Our method, on the other hand, avoids this pitfall because it selects substrings containing a limited number of original words. Selecting substrings in such a manner is the key advantage of our method.

Journal

References(13)*help

See more

Details 詳細情報について

Report a problem

Back to top