FPGA‐accelerated deep convolutional neural networks for high throughput and energy efficiency

  • Yuran Qiao
    Department of Computer, State Key Laboratory of High Performance Computing National University of Defense Technology Changsha Hunan China
  • Junzhong Shen
    Department of Computer, State Key Laboratory of High Performance Computing National University of Defense Technology Changsha Hunan China
  • Tao Xiao
    Department of Computer, State Key Laboratory of High Performance Computing National University of Defense Technology Changsha Hunan China
  • Qianming Yang
    Department of Computer, State Key Laboratory of High Performance Computing National University of Defense Technology Changsha Hunan China
  • Mei Wen
    Department of Computer, State Key Laboratory of High Performance Computing National University of Defense Technology Changsha Hunan China
  • Chunyuan Zhang
    Department of Computer, State Key Laboratory of High Performance Computing National University of Defense Technology Changsha Hunan China

抄録

<jats:title>Summary</jats:title><jats:p>Recent breakthroughs in the deep convolutional neural networks (CNNs) have led to great improvements in the accuracy of both vision and auditory systems. Characterized by their deep structures and large numbers of parameters, deep CNNs challenge the computational performance of today. Hardware specialization in the form of field‐programmable gate array offers a promising path towards major leaps in computational performance while achieving high‐energy efficiency.</jats:p><jats:p>In this paper, we focus on accelerating deep CNNs using the Xilinx Zynq‐zq7045 FPGA SoC. As most of the computational workload can be converted to matrix multiplications, we adopt a matrix multiplier‐based accelerator architecture. Dedicated units are designed to eliminate the conversion overhead. We also design a customized memory system according to the memory access pattern of CNNs. To make the accelerator easily usable by application developers, our accelerator supports Caffe, which is a widely used software framework of deep CNN. Different CNN models can be adopted by our accelerator, with good performance portability. The experimental results show that for a typical application of CNN, image classification, an average throughout of 77.8 GFLOPS is achieved, while the energy efficiency is 4.7× better than an Nvidia K20 GPGPU. © 2016 The Authors. <jats:italic>Concurrency and Computation: Practice and Experience</jats:italic> Published by John Wiley & Sons Ltd</jats:p>

収録刊行物

被引用文献 (1)*注記

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ