Optimization of Multi-level Tokenization for Improving Accuracy of Downstream Tasks

DOI

Bibliographic Information

Other Title
  • 後段タスクの精度向上のためのマルチレベルな分かち書きの最適化

Abstract

<p>Tokenization is known to affect the accuracy of downstream tasks. Hiraoka et al. proposed optok4at, an optimization method of tokenization for improving the accuracy of downstream tasks. However, since only one type of tokenizer is used in optok4at, and the vocabulary is formed by unsupervised learning, there is a risk that the tokenizer will miss infrequent but important phrases, resulting in a loss of accuracy. In this paper, we propose an optimization method using multiple tokenizers for improving the accuracy of downstream tasks. The proposed method concatenates the outputs of two tokenizers with different vocabularies and inputs them to the downstream model. By using not only an unsupervised tokenizer but also a dictionary-based tokenizer containing vocabularies of frequent phrases, we attempt to improve the accuracy of downstream tasks. In several text classification tasks, we confirmed that the proposed method does not contribute to improving the accuracy, despite it tokenizing phrases.</p>

Journal

Details 詳細情報について

Report a problem

Back to top