Towards an Asynchronous Checkpointing System

この論文をさがす

説明

The overall failure rate of HPC systems is increasing because the number of components is growing. Checkpoint/Restart, the most common technique to tolerate these faults, enables an application to restart from the last checkpoint even if a failure happens while the application is running. However, writing large checkpoint files may impact application runtime, depending on the bandwidth of the file systems to which checkpoints are written. To minimize the impact, we propose an asynchronous checkpointing system to write checkpoints to the file system in the background. This system uses extra nodes to drain a checkpoint from compute nodes using RDMA (Remote Direct Memory Access) to minimize CPU usage. Our preliminary evaluation shows that our asynchronous checkpointing system reduces checkpointing impact with runtime increases of CPU-bound applications under 1% compared to not checkpointing to a parallel file system.

収録刊行物

詳細情報 詳細情報について

  • CRID
    1574231876788646016
  • NII論文ID
    110008690460
  • NII書誌ID
    AN10463942
  • 本文言語コード
    en
  • データソース種別
    • CiNii Articles

問題の指摘

ページトップへ