Data Augmentation Using Pretrained Models in Japanese Grammatical Error Correction

Kato Hideyoshi, Okabe Masaaki, Kitano Michiharu, Yadohisa Hiroshi

doi:10.1527/tjsai.38-4_a-l41

Bibliographic Information

Other Title

日本語文法誤り訂正における事前学習済みモデルを用いたデータ増強

Description

<p>Grammatical error correction (GEC) is commonly referred to as a machine translation task that converts an ungrammatical sentence to a grammatical sentence. This task requires a large amount of parallel data consisting of pairs of ungrammatical and grammatical sentences. However, for the Japanese GEC task, only a limited number of large-scale parallel data are available. Therefore, data augmentation (DA), which generates pseudo-parallel data, is being actively researched. Many previous studies have focused on generating ungrammatical sentences rather than grammatical sentences. To tackle this problem, this study proposes the BERT-DA algorithm, which is a DA algorithm that generates correct sentences using a pre-trained BERT model. In our experiments, we focused on two factors: the source data and the amount of data generated. Considering these elements proved to be more effective for BERT-DA. Based on the evaluation results of multiple domains, the BERT-DA model outperformed the existing system in terms of the Max Match and GLEU+.</p>

Journal

Transactions of the Japanese Society for Artificial Intelligence

Transactions of the Japanese Society for Artificial Intelligence 38 (4), A-L41_1-10, 2023-07-01

The Japanese Society for Artificial Intelligence

Keywords

Details 詳細情報について

CRID: 1390015191520703488

DOI: 10.1527/tjsai.38-4_a-l41

ISSN: 13468030; 13460714

Web Site: https://www.jstage.jst.go.jp/article/tjsai/38/4/38_38-4_A-L41/_pdf

Text Lang: ja

Data Source

JaLC
Crossref
OpenAIRE

Abstract License Flag: Disallowed

Export

Report a problem