An Empirical Comparison and Ensemble Learning Methods of BERT Models on Authorship Attribution

  • KANDA Taisei
    Graduate School of Culture and Information Science, Doshisha University
  • JIN Mingzhe
    Research Center for Linguistic Ecology, Doshisha University / Institute of Interdisciplinary Research, Kyoto University of Advanced Science

Bibliographic Information

Other Title
  • 著者推定におけるBERTの比較分析とアンサンブル学習

Description

<p>Bidirectional Encoder Representations from Transformers (BERT) is a general-purpose language model designed to be pre-trained on a large amount of training data, fine-tuned, and then adapted to tasks in individual fields. Japanese BERT models have been released based on training data from Wikipedia, Aozora Bunko, and Japanese business news articles, which are relatively easy to obtain. In this study, we compared the performance of multiple BERT models constructed from different pre-training data on an author attribution task, and analyzed the impact of pre-training data on individual tasks. We also studied methods to improve the accuracy of author attribution models by ensemble learning using multiple BERT models. As a result, we found that a BERT model trained on the Aozora Bunko corpus performed well in estimating authors in Aozora Bunko. This clearly shows that pre training data affected the performance of the model when solving individual tasks. We also found that the performance of an ensemble learning architecture comprising multiple BERT models was better than that of a single model.</p>

Journal

  • Joho Chishiki Gakkaishi

    Joho Chishiki Gakkaishi 34 (3), 244-255, 2024-09-30

    Japan Society of Information and Knowledge

References(2)*help

See more

Details 詳細情報について

Report a problem

Back to top