- 【Updated on May 12, 2025】 Integration of CiNii Dissertations and CiNii Books into CiNii Research
- Trial version of CiNii Research Knowledge Graph Search feature is available on CiNii Labs
- 【Updated on June 30, 2025】Suspension and deletion of data provided by Nikkei BP
- Regarding the recording of “Research Data” and “Evidence Data”
Disentangling Knowledge Acquisition of LLMs through Direct Corpus Exploration
-
- HARAGUCHI Daichi
- NEC Data Science Laboratories
-
- TAMURA Takuya
- NEC Data Science Laboratories
-
- YANO Taro
- NEC Data Science Laboratories
-
- OYAMADA Masafumi
- NEC Data Science Laboratories
Bibliographic Information
- Other Title
-
- 事前学習コーパスの直接検索による LLM の知識獲得の構造理解
Description
<p>While Large Language Models (LLMs) have demonstrated impressive knowledge acquisition during pre-training, the mechanisms of this process remain poorly understood. Previous research has established a correlation between the frequency of knowledge instances in training corpus and the degree of knowledge acquisition. However, existing methodologies suffer from two key limitations: insufficient experimental validation of frequency, and inadequate consideration of conflicting knowledge within training data. To address these gaps, we conduct a direct investigation of pre-training corpus to unravel the knowledge acquisition process in LLMs. Our experiments demonstrate that higher frequency of knowledge leads to more robust knowledge acquisition. Furthermore, we discover that conflicting knowledge instances within the corpus impact the degree of knowledge acquisition. Notably, our analysis suggests the existence of latent conflicts that may hinder knowledge acquisition even in cases where conflicts are not immediately apparent on the surface level.</p>
Journal
-
- Proceedings of the Annual Conference of JSAI
-
Proceedings of the Annual Conference of JSAI JSAI2025 (0), 2Win534-2Win534, 2025
The Japanese Society for Artificial Intelligence
- Tweet
Keywords
Details 詳細情報について
-
- CRID
- 1390867654670628736
-
- ISSN
- 27587347
-
- Text Lang
- ja
-
- Data Source
-
- JaLC
-
- Abstract License Flag
- Disallowed