Disentangling Knowledge Acquisition of LLMs through Direct Corpus Exploration

HARAGUCHI Daichi, TAMURA Takuya, YANO Taro, OYAMADA Masafumi

doi:10.11517/pjsai.jsai2025.0_2win534

Bibliographic Information

Other Title

事前学習コーパスの直接検索による LLM の知識獲得の構造理解

Description

<p>While Large Language Models (LLMs) have demonstrated impressive knowledge acquisition during pre-training, the mechanisms of this process remain poorly understood. Previous research has established a correlation between the frequency of knowledge instances in training corpus and the degree of knowledge acquisition. However, existing methodologies suffer from two key limitations: insufficient experimental validation of frequency, and inadequate consideration of conflicting knowledge within training data. To address these gaps, we conduct a direct investigation of pre-training corpus to unravel the knowledge acquisition process in LLMs. Our experiments demonstrate that higher frequency of knowledge leads to more robust knowledge acquisition. Furthermore, we discover that conflicting knowledge instances within the corpus impact the degree of knowledge acquisition. Notably, our analysis suggests the existence of latent conflicts that may hinder knowledge acquisition even in cases where conflicts are not immediately apparent on the surface level.</p>

Journal

Proceedings of the Annual Conference of JSAI

Proceedings of the Annual Conference of JSAI JSAI2025 (0), 2Win534-2Win534, 2025

The Japanese Society for Artificial Intelligence

Details 詳細情報について

CRID: 1390867654670628736

DOI: 10.11517/pjsai.jsai2025.0_2win534

ISSN: 27587347

Text Lang: ja

Data Source

JaLC

Abstract License Flag: Disallowed

Export

Report a problem