References

[1] Recognition of convolutional neural network based on CUDA Technology.
[2] CS231N website.
[3] http://tiny-dnn.readthedocs.io/en/latest/

More details can be found via report

Future Work

In this project, we implemented a basic neural network structure and some basic layers. However, neural network is a very big topic and we can add a lot more features if time and better GPU resources are available. For example, we can add data augmentation, such as crop, rotation and translation to pre-process the data; we can add different optimizers and techniques such as drop out layers, skip layers to get better training results. Besides, other type of layers such as branch layers and concatenation layers can also be implemented to make the network structure be able to handle more complex problems. In summary, there are still a lot more worth exploring in parallel CNN based on CUDA techniques.

Results Analysis

Speedup and parallelism analysis

Computation resources ability. Due to the limit computation resources of our GPU, we think we can achieve better speedup if we got access to more powerful GPU.

Not parallelism issue, more because of dependency issues. For each layer, either forward path, backward path, weights update, we all use parallelism to make the computation much faster. However, based on the essential characteristics of neural network, we think the dependency is the main issue. We would only be able to parallel the computation within each layer, but we must follow the layer structure sequence. From the experiments we conducted, we find that adding more neurons in the convolutional layers and fully connected layers did not necessarily adds more time for computation. Doing this only increase the workload for the parallelism part, and as long as the number of parameters are within a certain number, the CUDA threads are able to handle the parallelism because each spatially local region can be computed independently. However, when we added more layers in the network structure, especially the convolutional layers, the computation time for training one epoch would greatly increased.

Synchronization overhead. We think the most difficult thing is synchronization overhead between different layers. Since for each layer, we need to compute the last layer's output in order to do the feed forward process for this layer; and also compute the gradient of the latter layer in order to do the back propagation process. This issue also exists within the layer when the computation includes some dependencies between each other such as softmax layer.

Data transfer. All the training and testing data are firstly read from files into CPU, and we need to transfer the data onto GPU in order to call the global functions to conduct the computation on device. Besides, the initialized weights, bias and gradients needs to be transfered onto GPU as well during the layer construction. Generally we are mollacing spaces for host data and device data on CPU and GPU, and create pointers to point to certain locations for corresponding host data and device data. The chanllenging things including avoiding memory leak and keep only necessary spaces to avoid occupying too much GPU memory.

Network training speedup. Results by experiments on both changing number of layers and number of trainable parameters while keeping the form of layers are shown in the tables above. The speedup when increasing trainable parameters while keep the form of layers unchanged is more obvious than speedup when increasing the number of layers. This is mainly because it's hard to parallel the sequential network structure but it's possible to parallel the data computation within a layer.

Network testing accuracy. As stated above, the training can achieve around 75% testing accuracy, which is reasonable for this simple network architecture. A similar network structure provided on Caffe Official Website for Cifar-10 classification problem can reach to around 75% testing accuracy.

Architecture

Language API: We choose to use CUDA as our method to solve this problem. The mainly reason is that the accessibility of GPU resources. For our implementation, we tested our code on a laptop that has a GTX960M GPU. The memory storage is 4 GB.

Context: We do change the sequential code to let it run on GPU. We also tested the performance between sequential code and parallel code.

Code structure: The whole folder consists of three folders and the main execution scripts. The folders are: 1. header, which contains the header files of all the C++ interface file and host helper functions running on CPU; 2. src, which contains the cpp files corresponding to the header folder; 3. layers, which contains header and cuda files of all the layer implementations and network structure.

Convolutional layer: Takes a feature map as input and convolve with a number of kernels and output a new resized feature map.

Maxpooling layer: Takes in a feature map as input (usually the output of a convolutional layer) and output the downsampling new feature map. The implementation can handle situations whether the pooling size equals to the stride or not.

ReLU layer: Takes a feature map as input and output the activated feature map with the size unchanged. This is to apply non-linear operation onto the feature map and also to avoid gradient vanishing in the backward process.

Fully connected layer: Takes a feature map as input and formatted it to a vector, and output a vector with nureons fully connecting all the input elements and output elements.

Softmax layer: Takes a vector as input and output the normalized possibility that the input belongs to which class.

2.png

1.png

2.png

1/2

Implementation Details

Convolutional layer
For convolutional layers, we assign each output element unit to a single thread in the forward process, and each input element unit to a single thread in the backward process to handle the computation. The value of each output unit is independent for a sequence of operations including convolution (multiplication and add) with weights kernel on a local region. Then we can apply vectorized adding operation on each output element with bias. For the backward process, as the weight gradient can be calculated as the multiplication between the input and gradient of the latter layer, we can set the number of tasks to be the number of input elements and conduct them in parallel.

Relu layer
For ReLU layer, the output dimension and input dimension are the same, and for each element the activation operation is independent. So the number the threads in both forward and backward process is just the outptRows * outputCols * outputChannels. The backward of the ReLU layer is just multiply the output with the gradient of the latter layer. We can process each single element in parallel.

Fully connected layer
For Fully connected layer, the input dimension is same as the last layer's output. The output of the fully connected layer's dimension is the number of neural in the hidden unit. Since the fully connected may be assigned at the end of convolutional layer, we need to change the format of input to a format input, then we use the cublas library to compute the matrix multiplication.

Max pooling layer
For the pooling layer, the number of threads corresponds to the outputRows * outputCols.
As the maxPooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The depth dimension remains unchanged. So for both forward and backward process, the threads assignments remain the same, with each operation corresponding to a small local region with the dimension of poolingSize * poolingSize * outputChannels. Each local region are computed in parallel. The backward process of max pooling is to initialize a zero map with the input size and assign the gradient of the latter layer to the corresponding location of maximum element in each spatially local region.

Softmax layer
Softmax layer is often combined with cross entropy loss. To compute the softmax layer, we first need to add up all the element with exponential operation applied. The sum of each individual exponential value can be stored in a shared space as each single element needs to be divided by this sum value to obtain the final output. So the parallelization can be divided into two phases, first to half-by-half add up all the exponential value, and then run in parallel for each element to obtain the final result.

Dependencies
In the forward process, each layer is dependent on the former layer, and in the backward process, each layer is dependent on the latter layer. This is the essential dependency regarding the basic network structure. Besides, dependency also exists in some layers. For example, each output element in the softmax layer is dependent on the completion of sum of all the single values with exponential operation applied. Details for each layer is already explained in the above sections.

Summary

Convolutional Neural Networks (CNNs) is widely used in many research areas, the implementation involves high arithmetic intensity. The new streaming model of Graphic Processing Unit (GPU), which has more transistor for data processing and many-core and blocks compared with CPU, significantly speedup the CPU, providing nice precision and accuracy, while maintains wide availability to end users and strong scalability. For our project, we propose to bridge the popular concept of deep learning with computer vision. In terms of application, we are most interested in applying the parallel programming method to image
classification, more specifically to build a CNN network on CUDA to classify the image.

Home: Inner_about

Home: About

Home: Quote

Home: Homepage_about

15618 Project - Parallel Programming for Image Classification based on Neural Network