In the new paper Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling, a research team from Peking University, ByteDance, and the University of Oxford presents Sparse Masked Modelling with Hierarchy (SparK), the first BERT-style pretraining approach that can be used on convolutional models without any backbone modifications.
Colossal-AI releases a complete open-source Stable Diffusion pretraining and fine-tuning solution that reduces the pretraining cost by 6.5 times, and the hardware cost of fine-tuning by 7 times, while simultaneously speeding up the processes! The fine-tuning task flow can also be conveniently completed on an RTX 2070/3050 PC.
In the new paper Unified Pretraining Framework for Document Understanding, an Adobe Research and Adobe Document Cloud team presents a unified pretraining framework for document understanding that enables cross-modal connections, relevant information highlighting in both visual and textual modalities, and cross-modal connections. UDoc achieves impressive performance on various downstream tasks.
A team from Google Research, University of Pennsylvania and Cornell University proposes a principled perspective to filter out common memorization for LMs, introducing “counterfactual memorization” to measure the expected change in a model’s prediction and distinguish “rare” (episodic) memorization from “common” (semantic) memorization in neural LMs.
Baidu researchers propose ERNIE-ViLG, a 10-billion parameter scale pretraining framework for bidirectional text-image generation. Pretrained on 145 million (Chinese) image-text pairs, ERNIE-ViLG achieves state-of-the-art performance on both text-to-image and image-to-text generation tasks.
In the paper A New Foundation Model for Computer Vision, a Microsoft research team proposes Florence, a novel foundation model for computer vision that significantly outperforms previous large-scale pretraining approaches and achieves new SOTA results across a wide range of visual and visual-linguistic benchmarks.
A Google Research team conducts a systematic exploration comprising more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with parameters ranging from 10 million to 10 billion, evaluated on more than 20 downstream image recognition tasks, aiming to capture the nonlinear relationships between performance on upstream and downstream tasks.
In a 200+ page paper, Percy Liang, Fei-Fei Li, and over 100 other researchers from the Stanford University Center for Research on Foundation Models (CRFM) systematically describe the opportunities and risks of large-scale pretrained “foundation” models. The unique study aims to provide a clearer understanding of how these models work, when and how they fail, and the various capabilities provided by their emergent properties.
A Google Research team proposes MergeDistill, a framework for merging pretrained teacher LMs from multiple monolingual/multilingual LMs into a single multilingual task-agnostic student LM to leverage the capabilities of the powerful language-specific LMs while still being multilingual and enabling positive language transfer.
A research team from Facebook shows how the power of transfer learning can enable pretraining on non-IDE, non-autocompletion and different-language example code sequences before fine-tuning on the autocompletion prediction task to improve model accuracy by over 50 percent on very small fine-tuning datasets and over 10 percent on 50k labelled examples.
A research team from Huawei Noah’s Ark Lab and Tsinghua University proposes Extract Then Distill (ETD), a generic and flexible strategy for reusing teacher model parameters for efficient and effective task-agnostic distillation that can be applied to student models of any size.