Distance Measures and Stemming Impact on Arabic Document Clustering
説明
Clustering of Arabic documents is considered as a vital aspect of obtaining optimal results from unsupervised learning. Its aim is to automatically group similar documents into a single cluster using different similarities or distance measures. However, diverse similarities and distance measures are available and their effectiveness in document clustering with a syntactic structure of the stemming is still not obvious. Therefore, this study aims to evaluate the impact of five similarity/distance measures (i.e., cosine similarity, the Jaccard coefficient, Pearson’s correlation coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) with two stemming algorithms (i.e., morphology- and syntax-based lemmatization; and morphology-based Information Science Research Institute (ISRI) stemming on clustering Arabic text dataset. We aim to identify the best performing similarity and distance measures and determine which measure is most suitable for Arabic document clustering. Our experimental method, which is based on syntactic structure and morphology, outperformed other stemming methods that use any of the five similarity/distance measures for Arabic document clustering. The best performing similarity/distance measures are cosine similarity and Euclidean distance, respectively.