Lottery Ticket Hypothesis
The Lottery Ticket Hypothesis is a really intersting piece of research, clinching the Best Paper Award in ICLR 2019. In this post, we delve into the LTH and related research advancing the pruniing literature.
Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (Frankle and Carbin, 2018)
The original paper conjectured that there exists smaller sub-networks (~10% of original size) in a randomly initialized network that is as trainable as the larger network, which was a surprising find as training smaller networks usually leads to convergence to local minima. Finding the lottery ticket (the sub-network that trains well) is well motivated, as we can thus obtain 10x compression without having to iteratively prune the model (iterative pruning requires 10x more training). They show that unstructured pruning one-shot or iteratively, where one prunes \(p^{\frac{1}{n}}\%\) of the parameters during the \(n^{th}\) round with the weights reset to the initialization values after each iteration, allows the pruned network to reach 10-20% of the original size. This suggests that pruning early to help identify winning initializations and architectures can help reduce overall training time required in generating high-performing pruned networks.
Deconstructing lottery tickets: Zeros, signs, and the supermask (Zhou et al, 2019)
This paper probes further into the design variables in LT hypothesis, namely the mask criteria, mask-1 actions and mask-0 actions. Concretely, the LT hypothesis has a mask criteria of the magnitude of weights post-training, with mask-1 actions being the resetting of retained weights to the initialized values, and mask-0 actions being zero-ing the pruned values. In this study, it is shown that the mask criterion of weight magnitude increase from training is equally competitive, while other variants (large or small initial values, small final values) are less so, when testing FC layers on MNIST and CNNs of 2/4/6 layers on CIFAR-10. The key finding in mask-1 actions is that the resetting of remaining weights to the initial values, a la LT hypothesis, is particularly efficient as the sign of values are retained; this is in contrast with other re-initialization schemes performing significantly worse when the sign is flipped or randomized, while having mild degradation when maintained.
The paper also shows that the standard mask-0 action of converting pruned weights to \(0\) is essential and acts as a form of training, illustrated by increasing degredation of performance compared to freezing the weights at their values as the CNN is deeper. However, a novel mask-0 action of freezing values during pruning if the updated values make them move away from zero during training, while zero-ing values otherwise, was shown to be more powerful than the vanilla zero-ing mask-0 action in standard pruning schemes. The combination of the ideal forms of these three design variables show good performance, with a randomly initialized network with sign masks applied obtaining up to \(86\%\) accuracy on MNIST, far out-performing random chance of \(10\%\).
Drawing early-bird tickets: Towards more efficient training of deep networks (You et al. 2019)
This paper builds on the lottery ticket hypothesis by suggesting that winning tickets can be found very early on the training phase, using a mask distance metric to minimize energy savings by \(\times 4.7\). The proposed alternative training scheme, termed the Early Bird (EB) training scheme, avoids having to train the entire network before pruning to obtain the winning tickets. Each pruned sub-network can be viewed as having a binary mask applied element-wise to the weights of the original network, where pruning involves setting the weights of the top \(p\%\) of weights with the smallest scaling factor \(r\) in the batch normalization layers (as a proxy for the channels’ significance in the layer). For two sub-networks, the mask distance is then the Hamming distance between two binary masks. The training scheme is such that the network is trained and iteratively pruned until the maximum Hamming distance between the past 5 consecutive sub-networks is smaller than a given limit, representing that the EB tickets identified are stable and will hardly change with further training. The EB scheme is shown to be robust to low precision training during the search for EB tickets.
Reconciling modern machine learning practice and the bias-variance trade-off (Belkin et al, 2018)
Traditional ML theory of bias-variance trade-off leads us to believe that as we increase the learning capacity of a model, the validation risk (loss) monotonically decreases then increases as the training loss goes to 0, with the former’s decrease and increase constituting under-fitting and over-fitting. The double decent risk curve empirically shows that Deep Learning models operate in a regime termed the ‘interpolating regime’, where while the training loss remains 0, the validation loss then reduces to a minimum equal or lower than at the cusp of under- and -overfitting. The regime is termed as such, as the understanding is that the over-parameterization (# parameters > # data points) allows the model to interpolate between data points, to obtain a hypothesis that better satisties Occam’s razor (in having a shorter minimum description length).
The connection with the Lottery Ticket Hypothesis is as such: Are lottery tickets sub-networks that operate in the traditional machine learning regime, in such a way that they are ‘equivalent’ to their over-parameterized counterparts? As such, is there something about the change in loss that allows us to determine which regime we are in when instantiating a neural network, that allows us to deterrmine both a) the ‘goodness’ of current initialization, without having access to the ground truth lottery ticket, and b) a gradient-based means to approach the lottery ticket?
Thoughts
In my opinion, the lottery tickets are nothing but good initializations of neural networks that, with SGD, can easily reach the global minima. Recreating high learning rates at the beginning of training when resetting to lottery ticket values is important, as high learning rates are likely crucial for the model to avoid valleys in the loss landscape. To avoid doing interative pruning as the search algorithm for the pruning mask, could we use other methods (e.g. genetic algorithms, simulated annealing, etc) to do this architectural search? My hypothesis is that the search costs of finding the lucky network of 10% size (i.e. a combinatorically intractable problem) using these methods might not be much better than SGD, where SGD afford a much richer training signal. A natural next question is if we can leverage information from the loss landscape during initialization to identify if a given initialization is a lottery ticket, in a way that is faster than SGD. To this end, and to understand the theoritical underpinings of why lottery tickets are effective, we looked at the Double Descent curve line of work.
My thoughts on the Double Descent paper are that some form of learning on the original task is essential, as iterative pruning is nothing but a form of imbuing the model with sparsity-based inductive bias; so long as SGD remains the most effective way to learn said inductive bias, iterative pruning is inevitable. Hence, to properly embark on the line of work of effectively identifying lottery tickets, one must design more effective heuristics that narrow the search space for SGD (through some additional loss factor), or shorten the learning process (through a stopping criterion akin to Early Bird).
Comments