Examination of Parallelization by CUDA in SPEC benchmark program

平 勇樹, 木村 啓二, 笠原 博徳

近年，GPU を汎用的な科学計算に用いる手法である GPGPU が注目されている．GPU は CPU と比べて高速な演算性能を持っているが，GPU の高い並列性を生かすためには並列性を持ったプログラムの選択と最適化が不可欠である．Doall のような単純な計算においては性能を出しやすいが，漸化計算（Reduction）のような並列性が低くなる計算では最適化を行わなければ性能を生かし切ることが出来ない．本研究報告では，並列性の高い汎用 SMP 用ベンチマークである SPEC OMPL2001 331.art を評価対象とし，最適化によって GPU の性能がどれほど得られ，データサイズによってどれほど性能向上の差があるかを調査する．GPGPU のための並列アーキテクチャ CUDA を用いてベンチマークプログラム 331.art を最適化し評価を行ったところ，12 コアでの並列実行の総計算時間と比べて 1.90 倍の速度向上を得た．また，配列サイズが 200 万個以上の漸化計算であればデータ転送帯域を有効に使うことが出来，CPU での並列実行より高速に動作させられることが確認できた．Recently, GPGPU which means a technique of General Purpose computing on GPU has attracted attention. GPU has a high-speed computing performance compared with CPU. Although in order to utilize a high parallelism that GPU have well, it is necessary to select a program with parallelism and optimize the program. It is easy to give high performance in the simple calculation such as Doall, but cannot make use of performance if you don't optimize a low parallelism compute such as Reduction. In this paper, we set a target for SPEC OMPL2001 331.art which has high parallelism and evaluate how much performance is provided by GPU optimize and evaluate how much difference will appear by changing data size of arrays. In this paper, we got speed-up of 1.90 times compared with the total calculation time of parallel execution in 12 cores. We can execute faster than parallel execution in 12 cores when we set a target as a Reduction which access to 2 Million data array.

Examination of Parallelization by CUDA in SPEC benchmark program

Bibliographic Information

Search this article

Description

Journal

Details 詳細情報について

Export

Report a problem