A Depthwise Separable Convolution Hardware Accelerator for ShuffleNetV2
Linshuang Li, Dihu Chen, Tao Su
School of Electronics and Information Technology, Sun Yat-Sen
Univer-sity, Guangzhou, Guangdong, People’s Republic of China
Email: stscdh@mail.sysu.edu.cn
Convolutional neural networks (CNNs) have been widely applied in the
field of computer vision with the development of artificial
intelligence. MobileNet and ShuffleNet, among other depthwise separable
convolutional neural networks, have gained significant advantages in
deploying on resource-constrained embedded devices due to their
characteristics such as fewer parameters and higher computational
efficiency compared to previous networks. In this paper, we focus on the
hardware implementation of ShuffleNetV2. We optimized the network
structure. Feature channel numbers, pooling modes, and channel shuffle
modes are modified, resulting in a 1.09% increase in accuracy while
reducing the parameter count by 0.18M. Additionally, we implement a
highly parallel hardware accelerator on the Xillinx xczu9eg FPGA, which
supports both standard convolution and depthwise convolution. The power
consumption of this accelerator is only 7.3W while achieving an energy
efficiency of 13.45 GOPS/W. The running frame rate achieves 675.7 fps.
Introduction: Nowadays, with the development of artificial
intelligence, Convolutional Neural Networks (CNNs) as one of its
representative algorithms have received increasing attention. Due to its
fast speed and small model size, depthwise separable convolutional
neural networks have gained significant advantages deploying on embedded
terminals.
However, depthwise separable convolutions decouple traditional
convolutions into depthwise convolution (DwC) and pointwise convolution
(PwC) [1]. Therefore, conventional CNN accelerators are no longer
suitable for performing computations in depthwise separable
convolutional neural networks. Based on current research, the pipelined
computing architecture introduced an additional feature bank to prefetch
data from off-chip memory [2], the introduction of additional
storage units leads to increased hardware resources and data read/write
time consumptions. The reconfigurable architecture considered the
combination of different computation modes in the network model that
supports both PwC and DwC calculations [3]. This structure cannot
guarantee that all modules are in an operational state during
computation, leading to inefficient utilization of computing resources
and resulting in wastage. It does not support complex operations such as
grouped convolution and network shuffle operations as well. Although
traditional channel shuffle mode separately handled the shortcut branch
and concating the results after PwC [4], it still resulted in a
significant amount of memory read and write operations, leading to high
latency.
Based on the above observations, the main contributions of this paper
are as follows. We redesigned the structure of the depthwise separable
convolution ShuffleNetV2. The network channels, pooling mode and channel
shuffle mode are optimized, resulting in a 12.9% decrease in the
parameter count and 1.09% accuracy increase. We also proposed a
hardware accelerator that supports both DwC and PwC, allowing them to
fully utilize and share the hardware resources of the computing array.
Achieving Energy efficiency ratio with minimal FPGA hardware resource
utilization, resulting in a frame rate of 675.7fps for image processing.
Design details: The purpose of this paper is to implement the
computation of depthwise separable convolution on resource-constrained
embedded terminals, enabling fast and efficient image classification.
Firstly, the pooling layer after the first convolution layer is removed
to ensure that more effective feature information enters the
construction block for feature extraction. Secondly, to ensure the
utilization of processing elements (PEs) in the convolution computing
array, the number of channels after each down sampling unit is changed
from 29× to 27×. This operation allows for the full utilization and
sharing of resources on the same computational array by both DwC and
PwC, without wasting hardware resources. The simplified ShuffleNetV2
structure is shown in Table 1. In addition, to further compress the
network, we quantized the weights by converting them from 32-bit
floating-point numbers to 8-bit fixed-point numbers, with 3 bits for the
integer part and 5 bits for the fractional part.