A research team from the University of California San Diego and Microsoft has come up with a novel approach to improve model accuracy on computer vision tasks at extremely low compute costs, achieving significant performance gains over state-of-the-art models.
Today’s increasingly efficient CNN architectures such as MobileNet and ShuffleNet have dramatically cut computational costs. On tasks such as ImageNet classification, the required compute has plummeted by two orders of magnitude — from 3.8G FLOPs to about 40M FLOPs — with acceptable performance trade-offs. But even the best models struggle in the extremely low FLOP regime (21M to 4M MAdds).
The researchers address this deficiency in the new paper MicroNet: Improving Image Recognition with Extremely Low FLOPs.
The study looks at extremely low FLOPs from the perspectives of node connectivity and non-linearity, which are related to network width and depth, respectively. To improve accuracy, they propose sparse connectivity and a dynamic activation function: the former avoids a significant reduction of network width, while the latter alleviates issues related to network depth reduction.
The proposed Micro-Factorized Convolution (MF-Conv) method optimizes trade-offs between the number of channels and node connectivity by factorizing a convolution matrix into low-rank matrices to integrate sparse connectivity into convolution.
Micro-Factorized pointwise and depthwise convolutions can be combined via regular combination, which concatenates the two convolutions; or lite combination, which employs Micro-Factorized depthwise convolutions to expand the number of channels, then applies one group-adaptive convolution to fuse and squeeze the number of channels.
The team also introduces dynamic Shift-Max (DY-ShiftMax), a dynamic non-linearity that strengthens connections between the groups created by micro-factorization, to improve non-linearity.
Based on their Micro-Factorized convolution and dynamic Shift-Max, the team designed MicroNet models comprising three Micro-Blocks. Micro-Block-A expands the number of channels via Micro-Factorized depthwise convolution and compresses them with a group-adaptive convolution. Micro-Block-B uses a full Micro-Factorized pointwise convolution to compress and expand the number of channels. Micro-Block-C meanwhile implements a regular combination of Micro-Factorized depthwise and pointwise convolutions.
The team created four models with varying computational costs (4M, 6M, 12M, 21M MAdds) and evaluated them on three tasks: image classification, object detection, and keypoint detection in human pose estimation.
In the evaluations, the 12M and 21M FLOP MicroNet models outperformed MobileNetV3 by 9.6 percent and 4.5 percent respectively in terms of top-1 accuracy on the ImageNet classification task; MicroNet-M3 achieved higher mAP (mean average precision) than MobileNetV3-Small ×1.0 with significantly lower backbone FLOPs (21M vs 56M) on the object detection task; and MicroNet-M3 outperformed the baseline while only consuming 22 percent (163.2M/726.9M) of the FLOPs on the keypoint detection task.
Overall, the MicroNet model family achieved solid improvements across all three tasks, demonstrating the proposed approach’s effectiveness under extremely low FLOP conditions.
The paper MicroNet: Improving Image Recognition with Extremely Low FLOPs is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.