Effect of Singular Value Decomposition and Weighting by Singular Value of Document-Term Matrix, for Large-scale Data Perspective and Targeted Data Extraction
-
- Hirano Mariko
- PASONA Inc.
-
- Kobayakawa Takeshi S.
- NHK Science & Technology Research Laboratories
Bibliographic Information
- Other Title
-
- 大規模データの俯瞰とターゲットデータの抽出に対する文書‐単語行列の特異値分解と特異値による重みづけの有効性
- 大規模データの俯瞰とターゲットデータの抽出に対する文書-単語行列の特異値分解と特異値による重みづけの有効性
- ダイキボ データ ノ フカン ト ターゲットデータ ノ チュウシュツ ニ タイスル ブンショ-タンゴ ギョウレツ ノ トクイチ ブンカイ ト トクイチ ニ ヨル オモミ ズケ ノ ユウコウセイ
Search this article
Abstract
We analyzed tweets broadcasted until four days after the occurrence of the Great East Japan Earthquake, which are provided by the Project 311. After obtaining a general view from tweets clustering, we created a set of targeted extraction categories from them and constructed a tweet extractor tailored to the target. In a sequence of such processes, improvement of the clustering, which is used to discover the target category for extraction, becomes very important. A method is proposed that utilizes the Singular Value as weights for features, while the well-known conventional use of Singular Value Decomposition is limited to reducing its dimension. In addition, we proposed an evaluation criterion for a human-aided clustering task, and conducted experiments to compare these criteria, including commonly-used ones, with the actual time spent by humans for performing such a task. The experiments show the effectiveness of the proposed weighting method and the competency of our criterion, mainly from the perspective of time efficiency of the task. As for the targeted data-extraction task, which is also a classification problem, some improvement in accuracy is observed although the training process itself involves a weighting mechanism.
Journal
-
- Journal of Natural Language Processing
-
Journal of Natural Language Processing 20 (3), 335-365, 2013
The Association for Natural Language Processing
- Tweet
Details 詳細情報について
-
- CRID
- 1390001204474527360
-
- NII Article ID
- 10031174537
-
- NII Book ID
- AN10472659
-
- ISSN
- 21858314
- 13407619
-
- NDL BIB ID
- 024763558
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
- NDL
- Crossref
- CiNii Articles
-
- Abstract License Flag
- Disallowed