2019-06-05

哈工大人工智能暑期学校作业

全文共: 2.3k字 | 阅读时长: 14分

阅读

文章导航

大二下暑假哈工大人工智能暑期学校《高性能机器学习》课程结课作业

Report on Paper “Adaptive Deep Reuse: Accelerating CNN Training on the Fly”

参考文章链接

1.Summary of the paper

(1)Main problems & motivation

This paper raises a method called “Adaptive Deep Reuse” in CNN training mainly to avoid the unnecessary calculation. As the experimental results show, in real operation of CNN training, there are many similarities among the images. The repetition of the same calculation will significantly cost lots of time and resources. If we can reuse the same calculation results, we will save a huge amount of unnecessary calculation and improve the time of CNN training. The mainly point is how to efficiently find the similar items. But there are also other points we should concern. We know that CNN training includes two kinds of propagations: forward and backward, they have some different method in training period. And as most people concerned, the accuracy of the result is also needed to be noticed in the training period. In addition, the new method must be suitable to most of the conditions we might face in real training. So how to find the same or similar neural vectors; whether the method of adaptive deep reuse useful in the backward propagation; how to reduce the errors and save time in this new training method; and how to make this method useful in different conditions of Convolution Neural Networks is the main questions this paper tries to solve.

(2)Method the authors used

i.How to find the same or similar neural vectors As above, we notice that there are many similarities among the images and the neural vectors. The most direct method we can think of is clustering. But the paper doesn’t use the commonest clustering method: k-means. We use another clustering method called: LSH(Locality-Sensitive Hashing). We define a parameter called “remain ratio”, which uses the number of the remain group to divide the original number of the neural vectors. This parameter describes the clustering level. Then we use a function which contains a random vector (v) to measure the input (x), if v·x>0, the result is 1, else the result is 0. We use several random vectors to form a gather of vector, when we calculate the v·x one-by-one, we will be able to get a bit vector called “ID”, just as the ID number of the neuron vector. If two of the vectors’ “ID” are same, we call them in a same cluster. From this method, we can find the similar neural vectors. The range of clustering can of course be different. We use three range of clustering to suit different situations: single-input level, single-batch level and across-batch level. ii.Whether the method of adaptive deep reuse useful in the backward propagation As we know, the backward propagation takes 2/3 of the whole training period. It is important to prove that the method we developed in part 1 is suitable for the backward propagation. If the answer is yes, we can use this method in both of two propagations and save a huge amount of calculations. There is a very simple rule we have learned before, which is called “chain rule”. That is, in a direct way, a method to determine the derivative of complex function. We have this kind of function during the forward propagation: y=x·W+b, when we do the backward propagation, we must solve two results: Weight Gradient (the weight’s partial respect to the loss function) and the input’s partial respect to the loss function. By using the chain rule, we can get the results we want, so we prove that the new method is useful in the backward propagation. iii.How to reduce the errors and save time in this new training method There is a common rule in natural word: we can’t have our cake and eat it. Reducing errors and saving time are two part that we cannot achieve the best at the same time. As the paper writes, although the time complexity of this new method can achieve a pretty low level, the accuracy also get lower. So we should make a decision on whether reduce the calculations or increase the accuracy of the new method. If we have a better equipment to do this training, the computation overhead might not be concerned, then we can achieve both of the two goals. So while training, based on our hardware, we can make a trade-off between accuracy and the computation overhead. iv.How to make this method useful in different conditions of CNN The paper comes up with two strategies to make this method suitable for different models of CNN. For the first one, we try to make sure that at beginning, the length (L) of the sub-vector is as large as possible, and the number of hashing functions (H) is as small as possible. But still make sure that the errors is acceptable. While in the end, let L be the as large as we can and H be as small as we can. In this way, we can save many unnecessary calculations at the beginning, and get the much accurate result in the end. There must be some rules set to make sure that the “L” and “H” are suitable during the training. And when to change them is also a important thing we should considered. We choose the exact time that when the loss function get it’s lowest point to change the parameters of “L” and “H”. The paper also set some limitations to guarantee that it makes shorter time cost and more accurate result. The second strategy is simpler than the first one. We use the parameter “CR” to express if we choose reuse or not. When CR=1, it shows that we use reuse, and when it comes to 0, it shows we don’t use it. During the training period, “L” and “H” do not change any more. We choose the suitable value of L and H, and set CR=1 at first. When the loss function achieves its lowest point, we let CR=0 and train it again.

(3)The experimental settings

From the paper, the authors choose three different networks called: CifarNet, AlexNet and VGG-19. Their ranges of layers and the datasets they used are different. Firstly, we test the similarity of the neural vectors, the result shows that all of them have a huge number of similarities, so we can use the deep reuse method to simplify the training period. We also find that the smaller the clustering granularity is, the more similarities we might find among the vectors, while it will cost more calculation overhead. To study each parameters’ influence on the results we use control variable method. The paper finds that when the remain ratio (rc) and H don’t change, the smaller the L is, the higher accuracy the result get. When L do not change, the larger the H is, the higher accuracy the result get and rc get larger, but it will also cause more calculations. And CR=1 (reuse the neural vectors) cost less calculations than CR=0, while the accuracy get shorter. The strategies we use as above may get different conditions. We test three strategies, the first strategy is that we don’t change the parameters of “L” and ”H”, the second strategy is that we change L and H to adapt the training period, the third strategy is the second one in (2) iv. The result is that the second one can get more time saving than the third one, and the first one is less than the other strategies.

(4)A brief summary of the findings of the work

Through this paper, the authors come up with a new method called adaptive deep reuse. This new method can significantly save the similar calculation through finding the similar neural vectors. They use LSH to find the similar vectors and reuse the same calculation results. They also prove that this method can also be used in the backward propagation. Then they optimize the strategy of choosing the parameters. Then they design several experiments to prove that this new method can significantly speed up the CNN training period.

2.Advantages of the method/system compared to other alternative methods

This paper come up with a new method to optimize the structure of CNN. In the past research, people always focus on the weight redundancy, and try to reduce the calculation in the convolutional layers. While this paper tries to reduce the calculation at a much earlier time: the input layer (the input picture or the activation map). By using this, we can reduce a huge amount of calculations at a very early time. Some research use rounding to reduce the calculation[1], although this method won’t reduce the accuracy of the final classification result, this method will cause unpredictable mistakes during training, which is uncontrollable. If we have to make a more accurate result, this method will not be suitable. This paper reuse the similar items in the input layer, can control the errors by define the parameter of “L”, ”H” and ”CR”. It gives us a pretty controllable way to limit the errors according to our demand. Another research also uses hashing[2] like this paper. They using hashing in the convolutional layers, that is also effective, but compared to this paper, the method doesn’t save the repetitive calculations at the very beginning, which will also cause time waste. The other study focus on reduce the superfluous weight, it is useful to save time because there are plenty of unnecessary weight among the neural networks, if we can remove them, the speed of training will of course raises. But this method had a shortcoming: the backward propagation (BP) is very complexed, and we should develop some extra strategies to solve this problem. [3] In this paper, we prove that this method can also be used in the BP problems. So we will save many energy by not focus on the extra strategies in solving the BP problems. Many researches use the sparsity of the layers to reduce the unnecessary calculations, but this method require a very sparse activation map. When it becomes complexed these methods will not be able to simplify the training period. So the method this paper uses is more general than those methods in this aspect.

3.The limitations of the method/system

There are four main method to solve this problem[4]: Network pruning and thinning, Tensor decomposition, knowledge transfer and Fine module design. We know that there are many redundant parameters in convolutional layers and fully connected layers. The adaptive deep reuse method cannot remove these redundant parameters, it can just simplify the similar calculations at a very early stage. But it doesn’t come up with a method to solve pruning. In this method, we have to make a trade-off between accuracy and the computation overhead. The more accurate results we make, the more computational expense we should pay. We can’t reduce the calculation and reduce the errors at the same time. It is the limitation caused by the neural network itself. This paper only focuses on reducing the calculation of each iteration, don’t pay attention to reduce the iterations number. It’s true that in reuse we can only reduce the calculation in only one iteration. This is the limitation of the method itself. Only if the input layer must have many similarities that this method can be more effective. If the input become complexed, in another word, don’t have many similar parts, the adaptive deep reuse will not be able to take advantages than other methods.

4.Possible improvements

We can see that the adaptive deep reuse can save much computational expense, but it still has limitations. The first limitation is that it cannot remove the redundant parameters. We can add some extra strategies to it when the training comes to the convolutional layers and fully connected layers. This method can make the simplify of the training further. The author mentions another method: enforcing a low-rank structure on the layers. This method is different from the method in this paper. Maybe we can combine two of them to make a better one. The clash between accuracy and the computation overhead limits our training on the neural network. We can select the parameters more rational so that both of them can achieve an ideal level. But the clash still exists, the only thing we can do is making it smaller. We don’t focus on reducing the iterations number of the network. Maybe next time we can add some other method like large-batch data parallelism, importance sampling, etc. For many conditions the inputs have a huge amount of similarities, so we don’t have to worry about the limitation of the inputs. But when we really face with this situation, we can use some other method to treat this picture, then pick out the main information in it.

REFERENCES

[1] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deeplearning with limited numerical precision,” inInternational Conferenceon Machine Learning, 2015, pp. 1737–1746. [2] [6] R. Spring and A. Shrivastava, “Scalable and sustainable deep learningvia randomized hashing,” inProceedings of the 23rd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining.ACM, 2017, pp. 445–454. [3] S. V. Kamarthi and S. Pittner, “Accelerating neural network trainingusing weight extrapolations,”Neural networks, vol. 12, no. 9, pp. 1285–1299, 1999. [4] LIN Jing-Dong, WU Xin-Yi, CHAI Yi, YIN Hong-Peng. Structure Optimization of Convolutional NeuralNetworks: A Survey.Acta Automatica Sinica, 2019,XX(X): X−X