Optimizing the channel counts for different layers of a convolutional neural network (CNN) is a proven way to improve efficiency and boost model test performance. The method however has its drawbacks, as it adds a large computational overhead.
To minimize this overhead and speed up model training time, a research team from Carnegie Mellon University, the University of Texas at Austin and Facebook AI has proposed width transfer, a technique that “harnesses the assumptions that the optimized widths (or channel counts) are regular across sizes and depths.” The novel approach optimizes widths for each CNN layer, is compatible across various width optimization algorithms and networks, and can achieve up to a 320x reduction in width optimization overhead without compromising top-1 accuracy on ImageNet.
The researchers summarize their contributions as:
- Propose width transfer, a novel paradigm for efficient width optimization, along with two novel layer-stacking methods to transfer width across networks with different layer counts.
- Find that the optimized widths are highly transferable across networks’ initial width and depth, and across datasets’ sample size and resolution.
- Demonstrate a practical implication of the previous finding by showing that one can achieve a 320x reduction in width optimization overhead for a scaled-up MobileNetV2 and ResNet18 on ImageNet with similar accuracy improvements, effectively making the cost of width optimization negligible relative to initial model training.
- With controlled hyperparameters over multiple random seeds on a large-scale image classification dataset, verify the effectiveness of width optimization methods proposed in prior art.
The design of layer-by-layer widths in a deep CNN can be considered a hyperparameter whose optimization improves classification accuracy without requiring additional floating-point operations (FLOPs). This is a nontrivial task, as obtaining a better design often requires intuition and domain expertise together with trial-and-error. Previous work has attempted to alleviate this labour-intensive trial-and-error process by using neural architecture search algorithms to identify layer-wise channel counts that maximize validation accuracy subject to test-time FLOPs constraints.
This approach however adds a large computational overhead for the width optimization procedure. Moreover, width optimization algorithms are often parameterized by some target test-time resource constraints, making the process prohibitively time-consuming for optimizing CNNs for embodied AI applications.
The researchers describe their width transfer paradigm as a first step toward understanding the transferability of optimized widths across different width optimization algorithms and invariance networks while reducing both the computational cost of width optimization and the dimensions of the optimization variables.
The proposed width transfer method first projects an original network and dataset to smaller counterparts. For network projection, the researchers employ a width multiplier to uniformly shrink channel counts to get a narrower model and a depth multiplier to uniformly shrink the block counts to get a shallower model. For dataset projection, they propose sub-sampling the training samples and the spatial dimensions of the training images to obtain lower-resolution images.
The team then extrapolates the optimized results. For exploration, they consider two aspects: dimension-matching and FLOPs-matching. They propose two layer-stacking strategies: stack-last-block and stack-average-block, to match the layer counts of the extrapolated network with the original network. The stack-last-block strategy aims to stack the width multipliers of the last block of each stage until the desired depth is met, while the stack-average-block strategy stacks the average width multipliers until the desired depth is met.
Finally, the team uses the width multiplier to widen the optimized width and match the FLOPs to the original network, completing the width transfer procedure.
To validate the transferability of the optimized widths across different projections and extrapolation strategies, the researchers conducted extensive experiments with state-of-the-art networks MobileNetV2, AutoSlim and MorphNet on the ImageNet dataset.
The projection of width experiment showed that MobileNetV2, AutoSlim and MorphNet can transfer well and reduce width optimization overhead by up to 80 percent. For depth projection, the optimized widths stay competitive via simple layer stacking methods and save up to 75 percent of width optimization overhead.
Overall, the proposed approach saved up to 95 percent width optimization overhead while outperforming the uniform baseline and matched the performance of direct optimization while saving 90 percent of width optimization overhead. The team also tested width transfer performance with compound projection, where it achieved up to a 320x width optimization overhead reduction.
The study shows the proposed width transfer approach operates well across different projections (width, depth and datasets) and extrapolation strategies, and can reduce not only the computational costs required for width optimization, but also the dimensions of the optimization variables.
The paper Width Transfer: On the (In)Variance of Width Optimization is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.