Word Segmentations In Medical Document Using Mutual Information and N-gram

  • Uesugi M
    Department of Medical Informatics Graduate School of Medicine Hokkaido University

Bibliographic Information

Other Title
  • N-gramと相互情報量を用いた医療用語抽出のための分割点の探索

Search this article

Description

We tried to explore the word segmentations with statistical information from N-grams and without dictionaries. This technique must be useful for the medical term extraction as the preprocessing to analyze the medical documents. When we will extract the medical terms from the medical documents with the technique, we believe that we can construct the relationship among the medical words in the documents and build the concept like ontology easily.<br/> Mutual Information (MI) was used to decide the word segmentations from six sorts of MI values with four N-grams. The four N-grams, unigram bigram trigram and quadrigram, were calculated from 9,800 summary, 3.2M characters, on Igaku-Chuou Magazine. Each MI value was calculated using a equation log (p(x, y)/p(x)/p(y)). Hence p(x), p(y) and p(x, y) are represented N-gram. For example, the MI Iuub(x, y) is calculated using p(x)=unigram of x, p(y)=unigram of y and p(x, y) =bigram of x+y. A summation of the six MI values (Iuub+Iubt+Iutq+Ibbq+Ibut+Ituq) and changing values of them were used to segment the words. When the summation of the MI values was threshold γ or less and the summation of their changing values was threshold δ or less, we determined that the segmentation existed between p(x) and p(y). And we settled both thresholds from the rate of correct segmentations in all and from the maximum difference between %correct segmentation and %all segmentation. As the result our method provided the 63.4% accuracy when the thresholds γ=4, δ=0.

Journal

References(15)*help

See more

Details 詳細情報について

Report a problem

Back to top