Corpus Filtering Focusing on Named Entities for Neural Machine Translation
-
- HOMMA Hiroki
- Tokyo Metropolitan University
-
- YAMAGISHI Hayahide
- Tokyo Metropolitan University
-
- MATSUMURA Yukio
- Tokyo Metropolitan University
-
- KOMACHI Mamoru
- Tokyo Metropolitan University
Bibliographic Information
- Other Title
-
- ニューラル機械翻訳におけるコーパスフィルタリングに関する固有表現に注目した分析
Abstract
<p>Some parallel corpora include sentences that disturb learning of machine translation systems. By removing such noisy sentences like containing many out-of-vocabulary from the training corpus, it is expected to makes better translations. In this paper, we focus on the sentences containing named entities because most of the named entities fall into out-of-vocabulary due to low-frequencies. We propose two kinds of filtering methods, using byte pair encoding and using named entity recognition. By removing noisy sentences from a training corpus on Japanese-English language pair, BLEU scores improve statistically significantly by 0.5 points in both proposed methods. Analysis revealed that both our methods overcome the mistakes such as suffix of the noun, determiner, and sentence lengths.</p>
Journal
-
- Proceedings of the Annual Conference of JSAI
-
Proceedings of the Annual Conference of JSAI JSAI2019 (0), 4Rin106-4Rin106, 2019
The Japanese Society for Artificial Intelligence
- Tweet
Keywords
Details 詳細情報について
-
- CRID
- 1390282763120938624
-
- NII Article ID
- 130007658892
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
- CiNii Articles
-
- Abstract License Flag
- Disallowed