GPU向けQCDライブラリQUDAへのTCAアーキテクチャの適用

藤田, 典久, 藤井, 久史, 塙, 敏博, 児玉, 祐悦, 朴, 泰祐, 藏増, 嘉伸, Mike, Clark, Norihisa, Fujita, Hisafumi, Fujii, Toshihiro, Hanawa, Yuetsu, Kodama, Taisuke, Boku, Yoshinobu, Kuramashi

QUDAはNVIDIAのGraphics Processing Unit（GPU）アクセラレータを利用できるオープンソースのLattice Quantum Chromo-Dynamics（QCD）フレームワークである．本論文では，我々が過去の研究より開発しているGPU間の直接通信を可能にするインターコネクトネットワークであるTightly Coupled Accelerators（TCA）architectureをQUDAに適用する．PCクラスタの異なるノードにあるPCI Express（PCIe）に接続されているアクセラレータ間を低レイテンシで接続するために，TCA architectureの開発を行っている．TCAはノード内のPCIe通信と同様に，GPU等のアクセラレータの間をノードをまたいで通信を可能にする．本論文ではHighly Accelerated Parallel Advanced system for Computational Science with the TCA feature（HA-PACS/TCA）を用いて性能評価を行う．HA-PACS/TCAは高密度GPUクラスタであり，TCA architectureの実証実験環境である．QUDAにTCAを適用した結果，TCAによって通信レイテンシが軽減され，アクセラレータを持つPCクラスタで一般的なInfiniBandよりも良い強スケリーングが達成された．TCAによる実装は，Conjugate Gradient（CG）法の反復あたりの性能がMPI peer-to-peer実装よりも2.19倍高速になり，MPI-3 Remote Memory Access（RMA）実装よりも1.23倍高速であるという結果が得られた．MPIによる実装はどちらの実装もInfiniBand QDRによるネットワークを利用している．

QUDA is a Lattice QCD library that can use NVIDIA's Graphics Processing Unit (GPU) accelerators, and is widely used as a framework for Lattice QCD applications. In this paper, we apply our novel proprietary interconnect network called the Tightly Coupled Accelerators (TCA) architecture, to inter-node GPU communication in QUDA. The TCA architecture was developed for low-latency inter-node communication among accelerators connected through the PCI Express (PCIe) bus on PC clusters. It enables direct memory copy between accelerators, such as GPUs, over nodes in the same manner as an intra-node PCIe transaction. We assess the performance of TCA on QUDA by a high-density GPU cluster HA-PACS/TCA, which is a proof-of-concept testbed for TCA architecture. The results show that our interconnection network system, which effects a stronger scaling than ordinary InfiniBand solutions on PC clusters with GPUs, significantly reduces communication latency. The execution time for Conjugate Gradient (CG) iteration shows that the TCA implementation is 2.19 times faster than peer-to-peer MPI implementation and 1.23 times faster than MPI-3 remote-memory access (RMA) implementation, where InfiniBand QDR network is used in both cases.

GPU向けQCDライブラリQUDAへのTCAアーキテクチャの適用

書誌事項

この論文をさがす

説明

収録刊行物

キーワード

詳細情報詳細情報について

書き出し

問題の指摘

GPU向けQCDライブラリQUDAへのTCAアーキテクチャの適用

書誌事項

この論文をさがす

説明

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について