Extracting String Features with Adaptation for Text Classification
-
- Onoe Toru
- Information and Computer Science, Toyohashi University of Technology
-
- Hirata Katsuhiro
- Information and Computer Science, Toyohashi University of Technology
-
- Okabe Masayuki
- Information and Media Center, Toyohashi University of Technology
-
- Umemura Kyoji
- Information and Computer Science, Toyohashi University of Technology
Bibliographic Information
- Other Title
-
- 文字列を特徴量とし反復度を用いたテキスト分類
- モジレツ オ トクチョウリョウ ト シ ハンプクド オ モチイタ テキスト ブンルイ
Search this article
Description
Feature selection for text classification is a procedure that categorizes words or strings to improve the classification performance. This operation is especially important when we use substrings as a feature because the number of substrings in a given data set is usually quite large. <BR> In this paper, we focus on the substring feature selection technique and describe a method that uses a statistic score called “adaptation” as a measure for the selection. Adaptation works on the assumption that strings appearing more than twice in a document have a high probability of being keywords; we expect this feature to be an effective tool for text classification. We compared our method with a state-of-the-art method proposed by Zhang et al. that identifies a substring feature set by removing redundant substrings that are similar in terms of statistical distribution. We evaluate the classification results by F-measure that is a harmonic mean of precision and recall. An experiment on news classification demonstrated that our method outperformed Zhang’s by 3.74% (it improves Zhang’s result from 79.65% to 83.39%) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed Zhang’s by 2.93% (it improves the Zhang’s result from 90.23% to 93.15%). We verified existence of significant difference between the results in each experiment. <BR> An experiment on news classification shows that our method is worse than a method of using word for feature by 0.49% (although there is no significant difference) on average based on the classification results. In addition, an experiment on spam classification demonstrated that our method outperformed the word method by 1.04% (our method improves its result from 92.11% to 93.15%). We verified that there is a significant difference between the results in spam classification experiment. <BR> Zhang’s method tends to extract substrings that are so short it is difficult to understand the original phrases from which they are extracted. This degrades classification performance because such a substring can be a part of many different words, some or most of which are unrelated to the original substring. Our method, on the other hand, avoids this pitfall because it selects substrings containing a limited number of original words. Selecting substrings in such a manner is the key advantage of our method.
Journal
-
- Journal of Natural Language Processing
-
Journal of Natural Language Processing 17 (1), 77-97, 2010
The Association for Natural Language Processing
- Tweet
Details 詳細情報について
-
- CRID
- 1390001204476607744
-
- NII Article ID
- 10027015990
-
- NII Book ID
- AN10472659
-
- ISSN
- 21858314
- 13407619
-
- NDL BIB ID
- 10576240
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
- NDL
- Crossref
- CiNii Articles
-
- Abstract License Flag
- Disallowed