書誌事項
- タイトル別名
-
- A Case-Based Semi-automatic Transformation from HTML Documents to XML Ones — Using the Similarity between HTML Documents Constituting a Series —
- 事例に基づくHTML文書からXML文書への半自動変換--シリーズ型HTML文書における類似性の利用
- ジレイ ニ モトヅク HTML ブンショ カラ XML ブンショ エ ノ ハンジドウ ヘンカン シリーズガタ HTML ブンショ ニ オケル ルイジセイ ノ リヨウ
- A Case-Based Semi-automatic Transformation from HTML Documents to XML Ones — Using the Similarity between HTML Documents Constituting a Series —
この論文をさがす
抄録
In order to utilize a large quantity of information in Internet, machine processing of HTML documents has been becoming tremendously important. HTML, however, is designed mainly for reading with browsers, thus not suitable for machine processing. XML was proposed as a solution for this problem. Unfortunately, full automatic transformation from HTML to XML is extremely difficult, because it absolutely demands to understand the meaning of HTML documents. On the other hand, there are many series of HTML pages in actual Web sites. Each page of a series usually has a quite similar structure with each other. Therefore a case-based transformation must be a promising method in practice. In this paper, we give a case-based transformation method from HTML documents to XML ones. Given a series of HTML documents and a sample transformation from a selected HTML document into XML one, we first analyze both of the semantic and syntactic information appearing in the sample pair. Next the remaining HTML pages of the series are automatically transformed into XML documents by using the information previously extracted from the sample. We adopt a vector model of term weighted frequency for approximating the meaning of HTML documents, and also use both headlines and a parse tree as syntactical information. Throughout experimental evaluation, we show this case-based method achieved a highly accurate transformation, i.e., 80% of actual 80 pages can be transformed in a correct way.
収録刊行物
-
- 人工知能学会論文誌
-
人工知能学会論文誌 16 (5), 408-416, 2001
一般社団法人 人工知能学会
- Tweet
詳細情報 詳細情報について
-
- CRID
- 1390282680083487872
-
- NII論文ID
- 10015770261
-
- NII書誌ID
- AA11579226
-
- ISSN
- 13468030
- 13460714
-
- NDL書誌ID
- 5987649
-
- 本文言語コード
- ja
-
- データソース種別
-
- JaLC
- NDL
- Crossref
- CiNii Articles
-
- 抄録ライセンスフラグ
- 使用不可