Implementation and performance evaluation of dynamic scheduling for parallel decision tree generation
説明
This paper proposes a parallel data-mining algorithm and its implementation on a PC cluster. The decision tree is a widely used data-mining algorithm for classifying records in a database. Simple parallelization of decision tree generation is not efficient because of the load imbalance caused by the form of the generated tree. The SPRINT algorithm solves this problem by grouping a set of nodes in the same level of the tree and balancing the load; however, frequent disk access is required when the data size exceeds the memory size. We propose an improved parallel algorithm of SPRINT by incorporating a dynamic scheduling. Dynamic scheduling is effective in reducing the amount of disk access for storing intermediate results; however, it may cause imbalance in data distribution on PEs (Processing Elements). We solved this problem by incorporating data redistribution. The evaluation result shows that our method realizes an improvement in speed of 3.5 times, for the best case, and equal performance even in the worst case, compared with SPRINT. We also discuss how further performance enhancement may be possible by improving the communication performance.
収録刊行物
-
- Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001
-
Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001 1579-1588, 2005-08-29
IEEE Comput. Soc