Project Proposal

Parallel programming for image classification based on Neural Network

We are going to parallel the training and testing procedure of image classification using simple neural network on CIFAR-10 dataset. We are going to implement the CUDA version instead of training on the CPU.

BACKGROUND

Training a neural network on CPU from a large dataset took a lot of time. This is due to the fact that the sequential program needs to sequentially compute each neural’s output. When the size of the network increases, the time for both training and testing increase dramatically. We plan to parallel the computation of each layer of neural network and reduce the time for training and testing.

THE CHALLENGES

Image classification is conducted by performing several consecutive layers on the input images. The computational cost on applying filter kernels on each image sub region is very high. How to partition the data, how to deal with the synchronization problem for the forward and backward process to create high level of parallelism is challenging. Besides, corner cases on the boundaries should also be taken care of when creating the parallelism.
The dependencies mainly happens among different layers for both forward and backward process. As for the memory access characteristics, there is locality for convolutional layer and max pooling layer. The communication to computation ratio depends on how you partition the data and the size of a single piece of data. When updating the weights and bias, the communication is unavoidable.
Workload imbalance, since each thread are assigned to different area in feature map, use a part of data, so there might be some threads fetching a lot of difficult tasks, which take longer for them to train, while other threads takes some easy tasks, which take less time for them to train. So there are definitely workload imbalance involved in the process. Also, beside the above difficulties, we need to compute the corresponding value for those input that lies on the boundary, which makes it more difficult to deal with.

RESOURCES

We are planning to use CMU GPU cluster to finish this project. We are planning to implement CUDA version of layers required by our network from scratch (convolutional, fully connected, max pooling layer). Direct access to CMU GPU cluster is all we need.

GOALS AND DELIVERABLES

Must to achieve: Implement convolution, fully connected, dropout, softmax forward pass in parallel
Plan to achieve: Implement convolution, fully connected, dropout both forward and backward in parallel
Hope to achieve: Implement different kinds of optimizers.
We are going to show the decent speedup compared to the sequential version. We are going to show the training procedure speed up and testing procedure speed up.
Our system is expected to achieve more than 4 times speed up compared to the sequential program.

PLATFORM CHOICE

CMU GPU clusters’ GPU has lots of cores, we would like to partition the sub region and update the corresponding parameters like weight and bias in parallel.

SCHEDULE

11.06 - 11.13 Implement Sequential program convolutional, maxpooling, fullyconnected layer

11.13 - 11.20 Implement Sequential program dropout layer

11.20 - 11.27 Implement code for convolutional, max pooling, dropout layer on CUDA

11.27 - 12.02 Implement training and testing script

12.03 - 12.10 Test and improve the final results

Proposal: Homepage_about