Extracting Text Data from HTML Documents

村上, 義継, Yoshitsugu, Murakami

Extracting Text Data from HTML Documents

IPSJ Web Site Web Site 3 Citations 17 References

Bibliographic Information

Other Title

HTMLからのテキストの自動切り出しアルゴリズムと実装
HTML カラノテキストノジドウキリダシアルゴリズムトジッソウ

Search this article

Description

World Wide Web で収集したHTML テキストから部分的にデータを取り出すプログラムをHTMLWrapper と呼ぶ．本研究ではHTML Wrapper のための新しいデータモデルを提案し，与えられたHTML から所望のテキストデータを切り出すためのHTML Wrapper を自動生成する機械学習アルゴリズムを構築する．さらにこのアルゴリズムをJava によって実装し，このアルゴリズムの有効性を検証する．

This paper introduces the new model of the HTML Wrapper for the information extraction from HTML documents and presents the learning algorithm for the HTML Wrappers in the framework of learning by exmaples. The expressiveness of this model is shown by experimental results.

Journal

情報処理学会論文誌数理モデル化と応用（TOM）

情報処理学会論文誌数理モデル化と応用（TOM） 42 (SIG14(TOM5)), 39-49, 2001-12-15

情報処理学会

Citations (3)*help

References(17)*help

Keywords

オリジナル論文

Details 詳細情報について

CRID

1050282812868508544
NII Article ID

10012520218

110002936511

110002726143
NII Book ID

AA11464803
ISSN

18827780

09196072

03875806
NDL BIB ID

5747753

6022913
Web Site

https://ipsj.ixsq.nii.ac.jp/records/17308

http://id.ndl.go.jp/bib/5747753

https://ndlsearch.ndl.go.jp/books/R000000004-I5747753

http://id.ndl.go.jp/bib/6022913

https://ndlsearch.ndl.go.jp/books/R000000004-I6022913
Text Lang

ja
Article Type

journal article
Data Source
- IRDB
- NDL Search
- CiNii Articles

Extracting Text Data from HTML Documents

Bibliographic Information

Search this article

Description

Journal

Citations (3)*help

References(17)*help

Keywords

Details 詳細情報について

Export

Report a problem