日本の苗字の計量的分析

梅田, 三千雄

日本の苗字が備えている種々の性質を明らかにすることを目的として苗字データベースを作成しその計量的分析を行った. ここではより普遍的なデータの収集をねらいとして約7.1万個から成る日本の苗字データベースを作成した. このデータベースをもとに苗字に出現する文字の種類や頻度文字位置とそこに出現する文字の種類など文字と文字連接に関する統計データを求めた. これより日本の苗字には文字位置によって出現する文字の種類とその頻度に大きな偏りのあることが明らかになった. さらに実際の使用頻度を考慮した分析として市販の電話帳データベースを利用した検索により苗字の使用頻度苗字ならびに文字と文字連接のエントロピーなどを測定した. これより苗字のエントロピーは英単語のそれにほぼ等しいことが明らかになった. またここで得られた苗字の諸性質は宛名や個人情報の文字認識において苗字部分の文字切り出しでの知識として利用したり認識対象文字の種類を決定限定したりするのに利用することが可能であり認識精度の向上につながることが期待される.

In this paper, Japanese family names database is constructed and several characteristics of Japanese family names are extracted from this database to be utilized in the process of characters recognition. This database contains 71452 kinds of Japanese family names. For example, one to six characters are used in family names and 80% of names consist of two characters. All Japanese family names are composed of 3796 character categories. There are 1400 character categories which are used more than 10 times in the names. When 1000 character categories are selected in order of appearance frequency, the rate of those characters used in the names is to be 92%. The 84% of all the family names are perfectly constructed by high frequency 1000 characters. Furthermore, by accessing Japanese telephone numbers database, some characteristics of family names considered the usage frequency are extracted samely. From these metrical analysis, the lack of precision in the pattern recognition algorithm can be recovered by using such characteristics of Japanese family names.

日本の苗字の計量的分析

書誌事項

この論文をさがす

説明

収録刊行物

参考文献 (8)*注記

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

日本の苗字の計量的分析

書誌事項

この論文をさがす

説明

収録刊行物

参考文献 (8)*注記

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について