Towards an Asynchronous Checkpointing System
-
- Kento Sato
- Tokyo Institute of Technology
-
- Adam Moody
- Lawrence Livermore National Laboratory
-
- Kathryn Mohror
- Lawrence Livermore National Laboratory
-
- Todd Gamblin
- Lawrence Livermore National Laboratory
-
- BronisR.DeSupinski
- Lawrence Livermore National Laboratory
-
- Naoya Maruyama
- Tokyo Institute of Technology
-
- Satoshi Matsuoka
- Tokyo Institute of Technology|National Institute of Informatics
この論文をさがす
説明
The overall failure rate of HPC systems is increasing because the number of components is growing. Checkpoint/Restart, the most common technique to tolerate these faults, enables an application to restart from the last checkpoint even if a failure happens while the application is running. However, writing large checkpoint files may impact application runtime, depending on the bandwidth of the file systems to which checkpoints are written. To minimize the impact, we propose an asynchronous checkpointing system to write checkpoints to the file system in the background. This system uses extra nodes to drain a checkpoint from compute nodes using RDMA (Remote Direct Memory Access) to minimize CPU usage. Our preliminary evaluation shows that our asynchronous checkpointing system reduces checkpointing impact with runtime increases of CPU-bound applications under 1% compared to not checkpointing to a parallel file system.
収録刊行物
-
- 情報処理学会研究報告. [ハイパフォーマンスコンピューティング]
-
情報処理学会研究報告. [ハイパフォーマンスコンピューティング] 2011 (18), 1-8, 2011-11-21
一般社団法人情報処理学会
- Tweet
キーワード
詳細情報 詳細情報について
-
- CRID
- 1574231876788646016
-
- NII論文ID
- 110008690460
-
- NII書誌ID
- AN10463942
-
- 本文言語コード
- en
-
- データソース種別
-
- CiNii Articles