Automatic Detection of Scientific Papers Based on Their Structure and Elements

Bibliographic Information

Other Title
  • 構造と構成要素に基づく学術論文の自動判定
  • コウゾウ ト コウセイ ヨウソ ニ モトズク ガクジュツ ロンブン ノ ジドウ ハンテイ

Search this article

Abstract

In this paper, we develop rules for the automatic detection of scientific papers from PDF files on the Web. We inspected the structure and elements of scientific papers and observed that scientific papers typically have certain basic elements and an IMRAD format. We examined 1,172 scientific papers on the Web. The results indicate that the papers share common elements such as title, authors, keyword, and references and 40% of the papers, which have an explicit structure, have an IMRAD or a similar format. We develop rules for automatic detection of scientific papers using information based on their structure and elements obtained from the inspection process. The rules are evaluated using English and Japanese PDF collections, which were compiled by random sampling from the Web and consisted of 20,000 files each. Random forest classifier is performed and an F-value of 0.74 is obtained for English PDF files and 0.53 for Japanese PDF files. These results indicate that the rules developed using the approach given in this study can detect scientific papers from PDF files on the Web.

Journal

Details 詳細情報について

Report a problem

Back to top