事例に基づくHTML文書からXML文書への半自動変換 — シリーズ型HTML文書における類似性の利用 —

書誌事項

タイトル別名
  • A Case-Based Semi-automatic Transformation from HTML Documents to XML Ones — Using the Similarity between HTML Documents Constituting a Series —
  • 事例に基づくHTML文書からXML文書への半自動変換--シリーズ型HTML文書における類似性の利用
  • ジレイ ニ モトヅク HTML ブンショ カラ XML ブンショ エ ノ ハンジドウ ヘンカン シリーズガタ HTML ブンショ ニ オケル ルイジセイ ノ リヨウ
  • A Case-Based Semi-automatic Transformation from HTML Documents to XML Ones — Using the Similarity between HTML Documents Constituting a Series —

この論文をさがす

抄録

In order to utilize a large quantity of information in Internet, machine processing of HTML documents has been becoming tremendously important. HTML, however, is designed mainly for reading with browsers, thus not suitable for machine processing. XML was proposed as a solution for this problem. Unfortunately, full automatic transformation from HTML to XML is extremely difficult, because it absolutely demands to understand the meaning of HTML documents. On the other hand, there are many series of HTML pages in actual Web sites. Each page of a series usually has a quite similar structure with each other. Therefore a case-based transformation must be a promising method in practice. In this paper, we give a case-based transformation method from HTML documents to XML ones. Given a series of HTML documents and a sample transformation from a selected HTML document into XML one, we first analyze both of the semantic and syntactic information appearing in the sample pair. Next the remaining HTML pages of the series are automatically transformed into XML documents by using the information previously extracted from the sample. We adopt a vector model of term weighted frequency for approximating the meaning of HTML documents, and also use both headlines and a parse tree as syntactical information. Throughout experimental evaluation, we show this case-based method achieved a highly accurate transformation, i.e., 80% of actual 80 pages can be transformed in a correct way.

収録刊行物

被引用文献 (12)*注記

もっと見る

参考文献 (25)*注記

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ