Automatic Detection of Scientific Papers Based on Their Structure and Elements

ISHITA Emi, AGATA Teru, MIYATA Yosuke, IKEUCHI Atsushi, UEDA Shuichi

doi:10.20651/jslis.60.1_18

Bibliographic Information

Other Title

構造と構成要素に基づく学術論文の自動判定
コウゾウトコウセイヨウソニモトズクガクジュツロンブンノジドウハンテイ

Search this article

Description

In this paper, we develop rules for the automatic detection of scientific papers from PDF files on the Web. We inspected the structure and elements of scientific papers and observed that scientific papers typically have certain basic elements and an IMRAD format. We examined 1,172 scientific papers on the Web. The results indicate that the papers share common elements such as title, authors, keyword, and references and 40% of the papers, which have an explicit structure, have an IMRAD or a similar format. We develop rules for automatic detection of scientific papers using information based on their structure and elements obtained from the inspection process. The rules are evaluated using English and Japanese PDF collections, which were compiled by random sampling from the Web and consisted of 20,000 files each. Random forest classifier is performed and an F-value of 0.74 is obtained for English PDF files and 0.53 for Japanese PDF files. These results indicate that the rules developed using the approach given in this study can detect scientific papers from PDF files on the Web.

Journal

Journal of Japan Society of Library and Information Science

Journal of Japan Society of Library and Information Science 60 (1), 18-34, 2014

Japan Society of Library and Information Science