Data extraction method from patents with small amount of training data for data-driven materials design

TSUYUKI Masafumi, AGATSUMA Shotaro, MUTO Kazuo

doi:10.11517/pjsai.jsai2024.0_3xin257

【Updated on May 12, 2025】 Integration of CiNii Dissertations and CiNii Books into CiNii Research
Trial version of CiNii Research Knowledge Graph Search feature is available on CiNii Labs
【Updated on June 30, 2025】Suspension and deletion of data provided by Nikkei BP
Regarding the recording of “Research Data” and “Evidence Data”

Data extraction method from patents with small amount of training data for data-driven materials design

DOI

TSUYUKI Masafumi

Hitachi, Ltd
AGATSUMA Shotaro

Hitachi, Ltd
MUTO Kazuo

Hitachi, Ltd

Bibliographic Information

Other Title

データ駆動型材料開発に向けた少量の学習データによる特許からの実験データ抽出技術

Description

<p>For data-driven materials design, it is important to construct a database by extracting experimental results from literature. The challenge is to speed up machine learning model customization for information extraction. In this study, we focused on large language models (LLMs) such as GPT-4, which can perform various tasks without additional training data. For the evaluation, we used the ChEMU2020 dataset for extracting information from patent related to chemical experiments. GPT-4 showed a high F1 score of 0.61 even with zero shots, but information extraction requiring domain knowledge, such as "catalyst," was difficult. Fine tuning SciBERT, which is specialized for scientific papers, using the low-rank adaptation, improved the F1 score to 0.71 even with a small amount of training data. These results suggest that an approach to fine-tune domain-specific models by correcting the LLM output to produce a small amount of training data is effective in speeding up model development.</p>

Journal

Proceedings of the Annual Conference of JSAI

Proceedings of the Annual Conference of JSAI JSAI2024 (0), 3Xin257-3Xin257, 2024

The Japanese Society for Artificial Intelligence

Keywords

Details 詳細情報について

CRID

1390018971042401408
DOI

10.11517/pjsai.jsai2024.0_3xin257
ISSN

27587347
Text Lang

ja
Data Source
- JaLC
Abstract License Flag
Disallowed

Data extraction method from patents with small amount of training data for data-driven materials design

Bibliographic Information

Description

Journal

Keywords

Details 詳細情報について

Export

Report a problem