活字資料のコーパス化における外字チェックと処理

須永, 哲矢, 堤, 智昭

歴史的作品の活字資料からコーパスを作るときの文字処理の方法を定めた．活字資料の電子化にあたっては，外字処理・字体包摂の２つが中心課題となるが，両者をまとめて処理できるツールを利用し，作業方式を確立することにより，もとになるテキストの特性によらず，統一的な処理を可能にした．本作業のために開発したツールと本稿で提案した一連の作業手順は，コーパス構築という作業のみならず，活字研究にも適用可能である．研究利用の例として，小学館新編日本古典文学全集『日本霊異記』等の漢字活字を調査し，JIS X0213 や UniCode でどの程度再現できるかを明らかにした．

The paper proposes a new processing procedure of external characters included in printed historical texts, which is essential to constructing an electronic corpus of historical documents. Digitization of printed historical documents so far has two major problems to be dealt with: representation of external characters and establishment of unification standard. We present a solution to the problems, introducing a new software tool which handles the two problems altogether. By applying the tool, the characters can be processed uniformly, regardless of the document variation. Furthermore, the processing tool and a series of procedures or our proposal can also be applied to character research. In the paper, we present a small sample investigation on the external characters of SNKBZ, Shogakukan, revealing what percentage of the total printing types JIS X0213 and Unicode respectively can represent.

活字資料のコーパス化における外字チェックと処理

Bibliographic Information

Abstract

Journal

Keywords

Details 詳細情報について

Export

Report a problem