Towards an Asynchronous Checkpointing System

Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, BronisR.DeSupinski, Naoya Maruyama, Satoshi Matsuoka

Towards an Asynchronous Checkpointing System

Kento Sato

Tokyo Institute of Technology
Adam Moody

Lawrence Livermore National Laboratory
Kathryn Mohror

Lawrence Livermore National Laboratory
Todd Gamblin

Lawrence Livermore National Laboratory
BronisR.DeSupinski

Lawrence Livermore National Laboratory
Naoya Maruyama

Tokyo Institute of Technology
Satoshi Matsuoka

Tokyo Institute of Technology|National Institute of Informatics

この論文をさがす

CiNii Books

説明

The overall failure rate of HPC systems is increasing because the number of components is growing. Checkpoint/Restart, the most common technique to tolerate these faults, enables an application to restart from the last checkpoint even if a failure happens while the application is running. However, writing large checkpoint files may impact application runtime, depending on the bandwidth of the file systems to which checkpoints are written. To minimize the impact, we propose an asynchronous checkpointing system to write checkpoints to the file system in the background. This system uses extra nodes to drain a checkpoint from compute nodes using RDMA (Remote Direct Memory Access) to minimize CPU usage. Our preliminary evaluation shows that our asynchronous checkpointing system reduces checkpointing impact with runtime increases of CPU-bound applications under 1% compared to not checkpointing to a parallel file system.

収録刊行物

情報処理学会研究報告. [ハイパフォーマンスコンピューティング]

情報処理学会研究報告. [ハイパフォーマンスコンピューティング] 2011 (18), 1-8, 2011-11-21

一般社団法人情報処理学会

キーワード

ファイルシステムとチェックポインティング

詳細情報詳細情報について

CRID

1574231876788646016
NII論文ID

110008690460
NII書誌ID

AN10463942
本文言語コード

en
データソース種別
- CiNii Articles

書き出し

問題の指摘

ページトップへ

Towards an Asynchronous Checkpointing System

この論文をさがす

説明

収録刊行物

キーワード

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について