Evolutionary Methods in Compression

Evolutionary algorithms are heruistic-based optimization algorithms inspired by biological evolution (Darwin, 1859). In a broad scope, it involves crafting your decision variables as chromosomes that change through evolution, to reach the optimal value. Evidently, form of optimization does not require the use of gradients, making it powerful when obtaining gradients is too costly or even impossible. Today, we will look at some papers in genetic and memetic evolutionary algorithms, and ponder how they might be applicable to model compression.

Towards Evolutionary Compression (KDD, 2018)

This paper casts the pruning problem as a binary programming problem, generating a pruning mask \(m \in {0,1}\) denoting if a weight in an orignal network should be pruned, while minimizing a combination of model size and compression error. As this problem is intractable (NP-Hard), there have been heuristics-based estimations (e.g. L1 norm), and gradient-based methods using estimated gradients of the change in loss (e.g. Straight-through estimator). This paper looks at using stochastic optimisation methods such as evolutionary computating, i.e. genetic algorithms, as the solver.

The problem formulation is as such: the fitness function is sum of the L1 norm of the pruning mask (i.e. counting the number of activated parameters of filters in a CNN), and the classification loss of tthe network with the pruning mask applied. The optimization algorithm is as follows: An initial population is first randomly initialized, with the chromosomes being its pruning mask. During each optimization step, each candidate solution is assigned a probability equal to its fitness score normalized over the entire population. One of three operations is applied to the population, namlely selection, crossover and mutation, iterated for as many units in the population.

In selection, a parent is randomly selected from the population (based on its assigned probability), and directly used as the offspring. This step makes successful solutions more likely to stay in the population. In crossover, two parents are selected according to their assigned probabilities, and a randomly chosen segment of its chromosome will cross-over between the parents to generate two offsprings (we select the better of two in the next population). This makes it such that successful parents can share subsets of their chromosomes to generate better children. In mutation, a parent is selected according to its assigned probability, and a randomly selected segment of its chromosome will be changed by multiplying the binary input with an XOR operation. The offspring is then the mutated parent. This operation allows the population to improve by random, and not be constrained by the current population’s gene pool.

After a maximum number of iterations, we obtain a population with improved fitness (i.e. higher compression with lower performance loss). We then choose the top performant solution for fine-tuning, to obtain the compressed model.

Multifactorial Evolution: Towards Evolutionary Multitasking (IEEE Transactions on Evolutionary Computation, 2016)

This paper introduces Multifactorial Optimisation (MFO), which differs from traditional Single- and Multi-Objective Optimisation (SOO and MOO respectively). SOO optimizes for a scalar objective value, while MOO optimizes for a vector-valued fitness function. In both SOO and MOO, the search spaces are unique. MFO is similar to MOO in that they both optimize for multiple objectives (i.e. tasks). Distinct from SOO and MOO, MFO considers multiple search spaces that may be interdependent. Another key difference between MFO and MOO is the latter considers all pareto optimal solutions to be ideal, whereas the former considers solutions with the best fitness wrt each objective (i.e. skill factor) as ideal. This trade-off might be ideal in the case where we only want solutions that excel on a small sub-set of tasks, allowing us to hone in on performant sub-networks for each use case; this allows us to mitigate the storing all pareto-optimal solutions, which for a high k (i.e. high number of tasks), quickly becomes intractable.

Multifactorial Evolutionary Algorithms (MFEA) is an evolutionary algorithm that allows the propogation of genotypes to come not only from selection/modification, but also cultural traits (a la memetic computation), used to solve MFO problems. As such, the features that affect the population genotypes are assortative mating and vertical cultural transmission.

Assortative mating is baises the mating of parents from the same cultural background, which is the skill factor (the task in which the solution is ranked the highest). If skill factors of selected parents for crossing over differ, either crossing over or mutation occurs, with the former occuring at a pre-set random mating probability. High biasing towards similar cultures (i.e. low random mutation probability) allows us to find solutions that are highly specialized quickly, but may converge to local minima.

Vertical cultural transmission is used to reduce the number of tasks evaluated for each candidate solution from k to 1, by only evaluating on the skill factor of either its only parent or either of its parent (50% chance). This makes it computationally more efficient, which makes sense considering the aforementioned assortative mating baises solutions to be specialists when the rmp is low.

The chromosomes are encoded as floats in the range of 0 to 1, and is mapped to a unified search space distinct from the constituent tasks. In this search space, each task is assigned a dimentionality one higher than the previous one, allowing us to operate in a search space of dimensionality equal to the number of tasks we are optimizing for. This allows us to bypass the curse of dimensionality in the case of naively concatenating decision variables of all tasks into a giant chromosome. One thing I am noting is how the ordering of tasks in assignment of dimensionality affects performance, where it seems like primary/rudimentary tasks should be assigned lower dimensions (where the interpretation of dimension assignment is like asssigning basis vectors in the optimization search space). To allow the encoding of constraint problems, such as binary or integer constraints, we allow the algorithm to search in the continuous space, but assign a high cost penalty for constraint violations. Algorithms such as the single linkage algorithm then identify clusters on the continuous space, i.e. finding two clusters in the binary case.

Multifactorial Evolutionary Algorithm with Online Transfer Parameter Estimation: MFEA-II (IEEE Transactions on Evolutionary Computation, 2020)

MFEA-II looks at the online learning setting of the MFEA problem space, i.e. multifactorial, multitask optimization. Specifically, a transfer parameter matrix is used during crossover to determine the extent of modification genes based on the similarity of tasks. This paper adds more theoretical grounding to the MFEA work, proving asymptoptic convergence to global optima when using MFEA as the candidate population tends to infinity.

To tackle the problem of negative interactions between tasks (e.g. unrelated tasks having different preferred solutions), this paper analyses the effect of rmp on the rate of convergence. Specifically, it was shown that an rmp of 0 (i.e. no knowledge transfer between dissimilar cultures) is faster to converge in the single task setting, making it clear that the rmp is a tradeoff that needs to be tuned.

Hence, to learn the rmp value that balances between convergence speed and performance, pair-wise rmp of each task is used in the form of a transfer parameter matrix. Learning this matrix is trivial, given the convex nature of the loss (KL divergence between the estimate and true distribution of offsprings), and that the transfer parameter matrix can be used to model the offspring distribution given the parents distribution.

Multi-Objective Multifactorial Optimization in Evolutionary Multitasking (IEEE Transactions on Cybernetics, 2016)

MO-MFEA is an evolutionary algorithm applied to the setting of solving multiple Multi-Objective Optimization problems simultaneously using the same evolving population. This is useful in the case when we are optimizing for different objectives, e.g. model size and model performance, while optimizing in a multitask setting (e.g. performance on different datasets, different tasks). In selecting offsprings, binary tournament selection is employed on the parent candidates based on the absolute value of the contenders’ fitness value (regardless of skill factors). Cross-over happens based on the rmp value, and if the parents’ skill factors are similar; otherwise, random mutation occurs to each parent in generating the offsprings.

Cognizant Multitasking in Multi-Objective Multifactorial Evolution: MO-MFEA-II (IEEE Transactions on Evolutionary Computation, 2020)

MO-MFEA-II is similar to MFEA-II in that they both utilize a transfer parameter matrix in an online setting to learn the degree of pair-wise inter-task transfer between generations. It is similar to MO-MFEA in that they both examine MFEA in a multi-objective setting.

Conclusion

In this post, we explored how evolutionary algorithms can be applied to model compression. We then looked at multi-objective, multifactorial optimization problems.

Bill Pung