NDLTableSet: Construction of a dataset for structuring table areas in digitized materials , and investigation of machine learning methods

青池, 亨, Toru, Aoike

デジタル化資料の画像中に含まれる表領域に関し，再解析やグラフによるデータの可視化を行うためには，表内部の数値等の情報を OCR（光学文字認識）によってテキスト化することに加えて，行及び列の位置関係やセル間の結合関係を踏まえた構造化処理を行う必要がある．機械学習分野における表構造化処理のための研究は，論文 PDF等，ボーンデジタルなリソースを対象に盛んに行われている一方で，スキャン撮影した非ボーンデジタルな資料においては利用可能な公開データセットがごく少ないことが課題であった．本研究では表領域に関する上記の課題を解決し，国立国会図書館が所蔵する著作権保護期間の満了したデジタル化資料の画像内の表データを利用可能とすることを目的として，表を構造化するためのデータセットをデジタル化資料の画像から構築し，これを利用して表構造推定のための機械学習モデルを開発することで構造化処理の自動化を検討した．

To reanalyze or graphically visualize data in a table in an image of a document, it is necessary to convert information in the table, such as numerical values, into text using OCR (Optical Character Recognition), and to perform structuring processing based on the positional relationships of rows and columns and the connection among cells. While research for table structuring processing in the machine learning field has been actively conducted on bo rn-digital resources s uch as PDFs of academic papers, there have been very few available public datasets for scanned, non -born-digital materials, making it difficult to study them. In order to solve the forementioned problem and to make available table data in the images of mat erials held by the National Diet Library whose copyright protection period has expired, this study create a dataset for structuring tables from images and developed a machine learning model to infer table structure using this dataset. This study examined the possibility of automation of the process by developing a machine learning model to infer the table structure.

NDLTableSet: Construction of a dataset for structuring table areas in digitized materials , and investigation of machine learning methods

Bibliographic Information

Description

Journal

Keywords

Details 詳細情報について

Export

Report a problem