CRFを用いた学術論文OCRテキストからの自動書誌要素抽出

薬師, 貴之, 太田, 学, 高須, 淳宏

文献データベースは学術論文を所蔵する電子図書館では不可欠である．しかし紙媒体の論文からの書誌要素抽出は，OCR などの画像処理技術を利用してもその抽出コストは高い．そこで本稿では，OCR 処理された学術論文から書誌要素を自動的に抽出する手法を提案する．提案手法では，まず OCR の文書画像処理によって得られた矩形テキスト領域に対して，あらかじめ定義した書誌要素を表すラベルを付与する．さらに，必要に応じて矩形テキスト領域内の各文字に対してもラベル付けを行う．この文字へのラベル付けによって，複数の著者名が記述された矩形テキスト領域から各著者の名前を抽出することができる．提案手法では，矩形テキスト領域や文字へのラベル付けに Conditional Random Fields（CRF）を使用する．言語の異なる 2 種類の論文誌を用いて実験を行ったところ，矩形領域へのラベル付けは，和文誌で 97.56%，英文誌で 97.27% の精度であった．また文字へのラベル付けによる和文誌の和文著書名領域からの各著者名の抽出精度は 99% 以上を達成した．

Bibliographic databases are indispensable to digital libraries of academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese and English academic articles. The experiments showed that the proposed method correctly extracted all the predefined bibliographic text blocks from 97.56% of the Japanese articles and 97.27% of English ones, respectively. The proposed method also correctly extracted all the author name strings from more than 99% of the Japanese authors' text blocks in the Japanese articles.

CRFを用いた学術論文OCRテキストからの自動書誌要素抽出

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (1)*注記

関連プロジェクト

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

CRFを用いた学術論文OCRテキストからの自動書誌要素抽出

書誌事項

この論文をさがす

抄録

収録刊行物

被引用文献 (1)*注記

関連プロジェクト

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

参加プロジェクトリスト

詳細情報詳細情報について