Deep learning-based generative models have become increasingly popular in recent years due to their ability to produce highly realistic images, texts and audio. Although deep autoregressive models have been researchers’ go-to for parameter choice via maximum likelihood estimation (MLE), variational autoencoder architectures (VAEs) — and especially the recently proposed hierarchical VAE (HVAE) — have demonstrated their potential to be competitive in MLE performance. The HVAE’s main limitations in this regard are its high compute cost and instability issues during its training phase.
In the new paper Efficient-VDVAE: Less is More, researchers from Cash App Labs introduce simple modifications to the Very Deep VAE that speedup convergence by 2.6x, save up to 20x in memory, and improve stability during training. Their modified VDVAE achieves state-of-the-art performance on seven commonly used image datasets.
The paper cites VDVAE studies showing that such networks generally benefit from more layers at higher resolutions, indicating that it is beneficial to have latent variables that learn local image details at high-resolution layers. While it is thus natural to add layers to the high-resolution layers to improve performance, adding more high-resolution layers also greatly increases memory requirements and leads to a diminishing returns problem.
The researchers propose that VDVAE’s stability and computational efficiency can be improved via architectural design choices that will also retain or boost MLE performance and, to this end, make the following minor VDVAE modifications:
- Bottom-up block: We added a skip connection before propagating the output towards the top-down block. This enables us to project the activations χ to any arbitrary width when passing it to the posterior computation branch (in the top-down block), even if the filters number of the rest of the model is changing.
- Pool layer: VDVAE uses a non-trainable average pooling to downsample activations. We replace that with a 1 × 1 convolution to have the freedom to change the number of filters.
- Unpool layer: We add a 1 × 1 convolution prior to the nearest neighbour upsampling to also have the freedom to change the filter size inside the top-down model.
For their empirical experiments, the team explored the effect of changing the optimization scheme to converge faster, training all models with reduced batch sizes to save computational cost. Memory was used as a cost measurement and negative log-likelihood (NLL) to measure resolution.
The experiments were performed using popular datasets such as CIFAR-10, Imagenet, MNIST, CelebA and FFHQ; and the proposed model was compared with state-of-the-art likelihood-based generative models such as PixelVAE++ and Image Transformer.
The results show that the modified VDVAE converges up to 2.6x faster than a conventional VDVAE, reduces memory load by up to 20x, and improves stability during training. The modified model also achieved comparable or better NLL performance compared to current state-of-the-art models on all seven image datasets used in the evaluations.
The researchers also note that only about three percent of the hierarchical VAE’s latent space dimensions can sufficiently encode most image information without any performance loss, indicating the potential to efficiently leverage the hierarchical VAEs’ latent space in downstream tasks.
The code is available on the project’s GitHub. The paper Efficient-VDVAE: Less is More is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.