Creating Chinese-English Comparable Corpora

HUANG Degen, WANG Shanshan, REN Fuji

doi:10.1587/transinf.e96.d.1853

抄録

Comparable Corpora are valuable resources for many NLP applications, and extensive research has been done on information mining based on comparable corpora in recent years. While there are not enough large-scale available public comparable corpora at present, this paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. The original Chinese document collections and English documents collections are crawled from XinHuaNet respectively and formatted in a consistent manner. For each document from the two collections, the best query keywords are extracted to represent the essential content of the document, and then the keywords are translated into the language of the other collection. The translated queries are run against the collection in the same language to pick up the candidate documents in the other language and candidates are aligned based on their publication dates and the similarity scores. Results show that our approach significantly outperforms previous approaches to the construction of Chinese-English comparable corpora.

収録刊行物

IEICE Transactions on Information and Systems

IEICE Transactions on Information and Systems E96.D (8), 1853-1861, 2013

一般社団法人電子情報通信学会

キーワード

詳細情報詳細情報について

CRID: 1390282679354705024

NII論文ID: 130003370967

DOI: 10.1587/transinf.e96.d.1853

ISSN: 17451361; 09168532

Web Site: https://www.jstage.jst.go.jp/article/transinf/E96.D/8/E96.D_1853/_pdf

本文言語コード: en

データソース種別

JaLC
Crossref
CiNii Articles

抄録ライセンスフラグ: 使用不可

Creating Chinese-English Comparable Corpora

抄録

収録刊行物

参考文献 (13)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

Creating Chinese-English Comparable Corpora

抄録

収録刊行物

参考文献 (13)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について