In the new paper Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling, a research team from Peking University, ByteDance, and the University of Oxford presents Sparse Masked Modelling with Hierarchy (SparK), the first BERT-style pretraining approach that can be used on convolutional models without any backbone modifications.
A research team from Carnegie Mellon University and Google systematically explores strategies for leveraging the relatively under-studied resource of bilingual lexicons to adapt pretrained multilingual models to low-resource languages. Their resulting Lexicon-based Adaptation approach produces consistent performance improvements without requiring additional monolingual text.
In the new paper Token Dropping for Efficient BERT Pretraining, a research team from Google, New York University, and the University of Maryland proposes a simple but effective “token dropping” technique that significantly reduces the pretraining cost of transformer models such as BERT without hurting performance on downstream fine-tuning tasks.
An Intel research team presents Prune Once for All (Prune OFA), a training method that leverages weight pruning and model distillation to produce pretrained transformer-based language models with high sparsity ratios. Applied to BERT, the approach achieves state-of-the-art results in compression-to-accuracy ratio.
A research team from ByteDance, Johns Hopkins University, Shanghai Jiao Tong University and UC Santa Cruz seeks to apply the proven technique of masked language modelling to the training of better vision transformers, presenting iBOT (image BERT pretraining with Online Tokenizer), a self-supervised framework that performs masked prediction with an online tokenizer.
A research team from Huawei Noah’s Ark Lab and Tsinghua University proposes Extract Then Distill (ETD), a generic and flexible strategy for reusing teacher model parameters for efficient and effective task-agnostic distillation that can be applied to student models of any size.
A research team from Google Research proposes small, fast, on-device disfluency detection models based on the BERT architecture. The smallest model size is only 1.3 MiB, representing a size reduction of two orders of magnitude and an inference latency reduction of a factor of eight compared to state-of-the-art BERT-based models.
UmlsBERT is a deep Transformer network architecture that incorporates clinical domain knowledge from a clinical Metathesaurus in order to build ‘semantically enriched’ contextual representations that will benefit from both the contextual learning and domain knowledge.
In what the company calls “the biggest leap forward in the past five years, and one of the biggest leaps forward in the history of Search,” Google today announced that it has leveraged its pretrained language model BERT to dramatically improve the understanding of search queries.
Since Google Research introduced its Bidirectional Transformer (BERT) in 2018 the model has gained unprecedented popularity among researchers. Now, a group of researchers from the National Cheng Kung University Tainan in Taiwan are challenging BERT’s efficacy.