A Depthwise Separable Convolution Hardware Accelerator for ShuffleNetV2
Linshuang Li, Dihu Chen, Tao Su
School of Electronics and Information Technology, Sun Yat-Sen Univer-sity, Guangzhou, Guangdong, People’s Republic of China
Email: stscdh@mail.sysu.edu.cn
Convolutional neural networks (CNNs) have been widely applied in the field of computer vision with the development of artificial intelligence. MobileNet and ShuffleNet, among other depthwise separable convolutional neural networks, have gained significant advantages in deploying on resource-constrained embedded devices due to their characteristics such as fewer parameters and higher computational efficiency compared to previous networks. In this paper, we focus on the hardware implementation of ShuffleNetV2. We optimized the network structure. Feature channel numbers, pooling modes, and channel shuffle modes are modified, resulting in a 1.09% increase in accuracy while reducing the parameter count by 0.18M. Additionally, we implement a highly parallel hardware accelerator on the Xillinx xczu9eg FPGA, which supports both standard convolution and depthwise convolution. The power consumption of this accelerator is only 7.3W while achieving an energy efficiency of 13.45 GOPS/W. The running frame rate achieves 675.7 fps.
Introduction: Nowadays, with the development of artificial intelligence, Convolutional Neural Networks (CNNs) as one of its representative algorithms have received increasing attention. Due to its fast speed and small model size, depthwise separable convolutional neural networks have gained significant advantages deploying on embedded terminals.
However, depthwise separable convolutions decouple traditional convolutions into depthwise convolution (DwC) and pointwise convolution (PwC) [1]. Therefore, conventional CNN accelerators are no longer suitable for performing computations in depthwise separable convolutional neural networks. Based on current research, the pipelined computing architecture introduced an additional feature bank to prefetch data from off-chip memory [2], the introduction of additional storage units leads to increased hardware resources and data read/write time consumptions. The reconfigurable architecture considered the combination of different computation modes in the network model that supports both PwC and DwC calculations [3]. This structure cannot guarantee that all modules are in an operational state during computation, leading to inefficient utilization of computing resources and resulting in wastage. It does not support complex operations such as grouped convolution and network shuffle operations as well. Although traditional channel shuffle mode separately handled the shortcut branch and concating the results after PwC [4], it still resulted in a significant amount of memory read and write operations, leading to high latency.
Based on the above observations, the main contributions of this paper are as follows. We redesigned the structure of the depthwise separable convolution ShuffleNetV2. The network channels, pooling mode and channel shuffle mode are optimized, resulting in a 12.9% decrease in the parameter count and 1.09% accuracy increase. We also proposed a hardware accelerator that supports both DwC and PwC, allowing them to fully utilize and share the hardware resources of the computing array. Achieving Energy efficiency ratio with minimal FPGA hardware resource utilization, resulting in a frame rate of 675.7fps for image processing.
Design details: The purpose of this paper is to implement the computation of depthwise separable convolution on resource-constrained embedded terminals, enabling fast and efficient image classification. Firstly, the pooling layer after the first convolution layer is removed to ensure that more effective feature information enters the construction block for feature extraction. Secondly, to ensure the utilization of processing elements (PEs) in the convolution computing array, the number of channels after each down sampling unit is changed from 29× to 27×. This operation allows for the full utilization and sharing of resources on the same computational array by both DwC and PwC, without wasting hardware resources. The simplified ShuffleNetV2 structure is shown in Table 1. In addition, to further compress the network, we quantized the weights by converting them from 32-bit floating-point numbers to 8-bit fixed-point numbers, with 3 bits for the integer part and 5 bits for the fractional part.