An Empirical Comparison and Ensemble Learning Methods of BERT Models on Authorship Attribution

KANDA Taisei, JIN Mingzhe

doi:10.2964/jsik_2024_022

Bibliographic Information

Other Title

著者推定におけるBERTの比較分析とアンサンブル学習

Description

<p>Bidirectional Encoder Representations from Transformers (BERT) is a general-purpose language model designed to be pre-trained on a large amount of training data, fine-tuned, and then adapted to tasks in individual fields. Japanese BERT models have been released based on training data from Wikipedia, Aozora Bunko, and Japanese business news articles, which are relatively easy to obtain. In this study, we compared the performance of multiple BERT models constructed from different pre-training data on an author attribution task, and analyzed the impact of pre-training data on individual tasks. We also studied methods to improve the accuracy of author attribution models by ensemble learning using multiple BERT models. As a result, we found that a BERT model trained on the Aozora Bunko corpus performed well in estimating authors in Aozora Bunko. This clearly shows that pre training data affected the performance of the model when solving individual tasks. We also found that the performance of an ensemble learning architecture comprising multiple BERT models was better than that of a single model.</p>

Journal

Joho Chishiki Gakkaishi

Joho Chishiki Gakkaishi 34 (3), 244-255, 2024-09-30

Japan Society of Information and Knowledge

Keywords

Details 詳細情報について

CRID: 1390583647822855552

DOI: 10.2964/jsik_2024_022

ISSN: 18817661; 09171436

Web Site: https://www.jstage.jst.go.jp/article/jsik/34/3/34_2024_022/_pdf

Text Lang: ja

Data Source

JaLC
Crossref

Abstract License Flag: Disallowed

Export

Report a problem