An Empirical Comparison and Ensemble Learning Methods of BERT Models on Authorship Attribution
-
- KANDA Taisei
- Graduate School of Culture and Information Science, Doshisha University
-
- JIN Mingzhe
- Research Center for Linguistic Ecology, Doshisha University / Institute of Interdisciplinary Research, Kyoto University of Advanced Science
Bibliographic Information
- Other Title
-
- 著者推定におけるBERTの比較分析とアンサンブル学習
Description
<p>Bidirectional Encoder Representations from Transformers (BERT) is a general-purpose language model designed to be pre-trained on a large amount of training data, fine-tuned, and then adapted to tasks in individual fields. Japanese BERT models have been released based on training data from Wikipedia, Aozora Bunko, and Japanese business news articles, which are relatively easy to obtain. In this study, we compared the performance of multiple BERT models constructed from different pre-training data on an author attribution task, and analyzed the impact of pre-training data on individual tasks. We also studied methods to improve the accuracy of author attribution models by ensemble learning using multiple BERT models. As a result, we found that a BERT model trained on the Aozora Bunko corpus performed well in estimating authors in Aozora Bunko. This clearly shows that pre training data affected the performance of the model when solving individual tasks. We also found that the performance of an ensemble learning architecture comprising multiple BERT models was better than that of a single model.</p>
Journal
-
- Joho Chishiki Gakkaishi
-
Joho Chishiki Gakkaishi 34 (3), 244-255, 2024-09-30
Japan Society of Information and Knowledge
- Tweet
Details 詳細情報について
-
- CRID
- 1390583647822855552
-
- ISSN
- 18817661
- 09171436
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
- Crossref
-
- Abstract License Flag
- Disallowed