Metadata Extraction by Sentence Element Analysis and Named Entity Extraction

齋藤 悠, SAITO Yu

doi:10.15002/00009570

There are a lot of documents without metadata about their semantic content on the web, but it is impracticable to manually add metadata to them considering costs; therefore a method to automatically extract metadata is necessary. This paper proposes an approach to extract such 4W+HM metadata as <when>, <where>, <who>, <what>, and <hm> (stands for how-much and how-many) from plane texts and evaluates it to measure its retrieval effectiveness. In this approach, metadata extraction process is free from dictionaries customized for specific fields, can apply general documents, and mainly consists of three parts: First, sentence element analysis part identifies what role each chunk in sentences plays; Second, named entity extraction part finds out remarkable expressions in sentences; Third, based on information given from previous parts, a criterion-based method outputs metadata. The evaluation experiment of metadata extraction is performed by calculating recall and precision using a test set which is manually added correct metadata. Mextractr, which is related work to extract 5W1H metadata, is also evaluated with the same test set, and its results are compared with proposed method’s. The experimental results show that proposed method is almost superior to Mextractr. In particular, proposed method excels in distinguishing <who> and <where>, which means that sentence elemental analysis works well not to confuse actors and places.

Metadata Extraction by Sentence Element Analysis and Named Entity Extraction

Bibliographic Information

Search this article

Abstract

Journal

Details 詳細情報について

Export

Report a problem