近代の歴史的資料を対象とした機械学習による文境界推定

白井, 良介, 松村, 雪桜, 小木曽, 智信, 小町, 守

本稿では，機械学習を用いて近代の歴史的資料に対して文境界を検出する手法を提案する．近代の歴史的資料は明確な文境界が必ずしも存在しないため，これまで人手作業による文境界の付与が行われてきたが，膨大な資料に対してなかなか作業が進んでいない現状がある．そこで我々は機械学習を用いて文境界を検出する手法を提案する．この手法により膨大な量の資料に対して文境界の一次的なアノテーションを施すことができることに加えて，形態素解析の精度を向上させたことが本研究の貢献である．また，モデルの訓練に日本語の近代語のデータを使用して，複数の機械学習手法を比較して近代の歴史的資料を対象とした文境界推定を行うのは本研究が初めてである．

In this study, we propose a method to detect sentence boundaries for modern Japanese texts using machine learning. For modern Japanese texts, sentence boundaries are not explicitly marked so that human annotation is inevitable, but the annotation process is far from complete due to enormous number of materials. Therefore, we propose a method to detect sentence boundaries using machine learning. The main contribution of this study is that this method can support the annotation task as a primary annotation. We also show that the accuracy of morphological analysis can be improved by performing sentence boundary detection. Moreover, this is the first work to detect sentence boundaries targeting modern Japanese texts by using modern Japanese data for model training and comparing multiple machine learning methods.

近代の歴史的資料を対象とした機械学習による文境界推定

Bibliographic Information

Search this article

Abstract

Journal

Keywords

Details 詳細情報について

Export

Report a problem