Distance Measures and Stemming Impact on ‎Arabic Document Clustering

説明

Clustering of Arabic documents is considered as a vital aspect ‎of obtaining optimal results from unsupervised learning. ‎Its aim ‎is to automatically group similar documents into a single cluster ‎using different similarities or distance measures. ‎However, ‎diverse similarities and distance measures are available and their ‎effectiveness in document clustering with a ‎syntactic structure ‎of the stemming is still not obvious. Therefore,‎‏ this study aims to evaluate the impact of five ‎similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson’s correlation coefficient, Euclidean ‎distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology- and syntax-based ‎lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic ‎text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is ‎most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and ‎morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic ‎document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance‎, respectively.

詳細情報 詳細情報について

問題の指摘

ページトップへ