In the new paper YOLOv7: Trainable Bag-Of-Freebies Sets New State-Of-The-Art for Real-Time Object Detectors, an Academia Sinica research team releases YOLOv7. This latest YOLO version introduces novel “extend” and “compound scaling” methods that effectively utilize parameters and computation; and surpasses all known real-time object detectors in speed and accuracy.
In the new paper Is a Modular Architecture Enough?, a research team from Mila and the Université de Montréal conducts a rigorous and thorough quantitative assessment of common modular architectures that reveals the benefits of modularity and sparsity for deep neural networks and the sub-optimality of existing end-to-end learned modular systems.
In the new paper Extreme Compression for Pre-trained Transformers Made Simple and Efficient, a Microsoft research team introduces XTC, a simple yet effective extreme compression pipeline for pretrained transformers that can achieve state-of-the-art results while reducing model size by 50x.
In the new paper Rare Gems: Finding Lottery Tickets at Initialization, a research team from Carnegie Mellon University, MBZUAI, Petuum, Inc and the University of Wisconsin-Madison proposes GEM-MINER, an algorithm that finds sparse subnetworks at initialization trainable to accuracy that is comparable or better than iterative magnitude pruning (IMP) with warm-up.
The BAAI Conference 2022 kicked off on May 31 in Beijing and ran through June 2. AI experts, industry leaders, young talents and international delegates joined the virtual gathering and live stream for three busy days of high-level keynotes, tech talks, parallel forums and networking.
In the new paper Factory: Fast Contact for Robotic Assembly, a research team from NVIDIA and the University of Washington introduces Factory, a set of physics simulation methods and robot learning tools for simulating contact-rich interactions in assembly with high accuracy, efficiency, and robustness.
In the new paper UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, a Google Brain research team proposes UViM, a unified approach that leverages language modelling and discrete representation learning to enable the modelling of a wide range of computer vision tasks without task-specific modifications.
In the new paper Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, a Google Brain research team presents Imagen, a text-to-image diffusion model that combines deep language understanding and photorealistic image generation capabilities to achieve a new state-of-the-art FID score of 7.27 on the COCO dataset.
In the new paper Tracing Knowledge in Language Models Back to the Training Data, a team from MIT CSAIL and Google Research proposes a benchmark for tracing language models’ assertions to the associated training data, aiming to establish a principled ground truth and mitigate high compute demands for large neural language model training.
In the new paper Large Language Models are Zero-Shot Reasoners, a research team from the University of Tokyo and Google Brain demonstrates that large language models (LLMs) can become good zero-shot reasoners through the addition of a simple prompt — “Let’s think step by step” — that elicits a step-by-step thinking process before each question is answered. Their Zero-shot-CoT model achieves huge performance gains compared to the zero-shot baseline.
In the new paper Automated Crossword Solving, researchers from UC Berkeley and Matthew Ginsberg LLC present the Berkeley Crossword Solver (BCS), an end-to-end state-of-the-art system for automatically solving challenging crossword puzzles that captured first place in the American Crossword Puzzle Tournament.
In the new paper Masked Autoencoders As Spatiotemporal Learners, a Meta AI research team extends masked autoencoders (MAE) to spatiotemporal representation learning for video. The novel approach introduces negligible inductive biases on space-time while achieving strong empirical results compared to vision transformers (ViTs) and outperforms supervised pretraining by large margins.
In the new paper Meta-Learning Sparse Compression Networks, a DeepMind research team proposes steps for scaling implicit neural representations (INRs). The resulting meta-learning sparse compression networks can represent diverse data modalities such as images, manifolds, signed distance functions, 3D shapes, and scenes, achieving state-of-the-art results on some of them.
In the new paper Rethinking Reinforcement Learning Based Logic Synthesis, a research team from Huawei Noah’s Ark Lab develops a novel reinforcement learning-based logic synthesis method to automatically recognize critical operators and produce common operator sequences that are generalizable to unseen circuits.
In the new paper Standing on the Shoulders of Giant Frozen Language Models, AI21 Labs researchers propose three novel methods for learning small neural modules that specialize a frozen language model to different tasks. Their compute-saving approach outperforms conventional frozen model methods and challenges fine-tuning performance without sacrificing model versatility.
In the new paper Quantum Self-Attention Neural Networks for Text Classification, a team from Baidu Research and the University of Technology Sydney proposes the quantum self-attention neural network (QSANN), a simple yet powerful architecture that is effective and scalable to large real-world datasets.
In the new paper Unifying Language Learning Paradigms, a Google Research/Brain team proposes a framework for pretraining universal language models that are effective across many different tasks. Their 20B parameter model surpasses 175B GPT-3 on the zero-shot SuperGLUE benchmark and triples the performance of T5-XXL on one-shot summarization tasks.
In the new paper i-Code: An Integrative and Composable Multimodal Learning Framework, a Microsoft Azure Cognitive Services Research team presents i-Code, a self-supervised pretraining framework that enables the flexible integration of vision, speech, and language modalities and learns their vector representations in a unified manner.
A research team from Rikkyo University and AnyTech Co., Ltd. examines the suitability of different inductive biases for computer vision and proposes Sequencer, an architectural alternative to ViTs that leverages long short-term memory (LSTM) rather than self-attention layers to achieve ViT-competitive performance on long sequence modelling.
In the new paper A Probabilistic Interpretation of Transformers, ML Collective researcher Alexander Shim provides a probabilistic explanation of transformers’ exponential dot product attention and contrastive learning based on distributions of the exponential family.
In the new technical report OPT: Open Pre-trained Transformer Language Models, Meta AI open-sources OPT, a suite of decoder-only pretrained transformers ranging from 125M to 175B parameters. The release will enable more researchers to work with large-scale language models to drive the field forward.
In the new paper CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, Tsinghua University and the Beijing Academy of Artificial Intelligence researchers pretrain a Cross-Modal general Language Model (CogLM) for text and image token prediction and finetune it for fast super-resolution. The resulting CogView2 hierarchical text-to-image system achieves significant speedups while generating images with better quality at comparable resolutions.
In the new paper Flamingo: a Visual Language Model for Few-Shot Learning, a DeepMind research team presents Flamingo, a novel family of visual language models (VLMs) that can handle multimodal tasks such as captioning, visual dialogue, classification and visual question answering when given only a few input/output samples.
Waymo and Google researchers’ new paper A Polynomial Expansion Perspective of Classification Loss Functions presents PolyLoss, a novel and simple framework that redesigns loss functions as a linear combination of polynomial functions that can be tailored to different target tasks and datasets.
In the new paper Expanding the Latent Space of StyleGAN for Real Face Editing, a research team from Northeastern University and Microsoft presents a novel two-branch method that expands the latent space of StyleGAN to enable identity-preserving and disentangled-attribute editing for real face images. The proposed approach achieves both qualitative and quantitative improvements over state-of-the-art methods.
A research team from BIGO Technology and iQIYI Inc. presents ClothFormer, a novel video virtual try-on framework that preserves clothes’ and humans’ features and details to generate realistic and temporally smooth try-on videos that surpass the outputs of current state-of-the-art virtual try-on systems by a large margin.
In the new paper Unified Pretraining Framework for Document Understanding, an Adobe Research and Adobe Document Cloud team presents a unified pretraining framework for document understanding that enables cross-modal connections, relevant information highlighting in both visual and textual modalities, and cross-modal connections. UDoc achieves impressive performance on various downstream tasks.
A research team from the University of Tokyo addresses the challenge of deepfake detection in their new paper Detecting Deepfakes with Self-Blended Images, proposing self-blended images (SBIs), a novel synthetic training data approach that outperforms state-of-the-art methods on unseen manipulations and scenes for deepfake detection tasks.
In the new paper PP-Matting: High-Accuracy Natural Image Matting, a Baidu research team proposes PP-Matting, a trimap-free architecture that combines a high-resolution detail branch and a semantic context branch to achieve state-of-the-art performance on natural image matting tasks.
A research team from DeepMind, Mila – University of Montreal and Google Brain proposes a neural network architecture that learns the graph structure of observational and/or interventional data via supervised training on synthetic graphs, making causal induction a black-box problem that generalizes well to new synthetic and naturalistic graphs.
In the new paper Dancing Under the Stars: Video Denoising in Starlight, a research team from UC Berkeley and Intel Labs leverages a GAN-tuned, physics-based noise model to represent camera noise under low light conditions and trains a novel denoiser that, for the first time, achieves photorealistic video denoising in starlight.
In the new paper A Modern Self-Referential Weight Matrix That Learns to Modify Itself, a research team from The Swiss AI Lab, IDSIA, University of Lugano (USI) & SUPSI, and King Abdullah University of Science and Technology (KAUST) presents a scalable self-referential weight matrix (SRWM) that leverages outer products and the delta update rule to update and improve itself.
In the new paper DeepDPM: Deep Clustering With an Unknown Number of Clusters, a research team from the Ben-Gurion University of the Negev presents DeepDPM, an effective deep nonparametric approach that removes the need to predefine the number of clusters in clustering tasks and can infer it instead.
In the new paper Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results, a research team from Alibaba Group’s DAMO Academy introduces USI (Unified Scheme for ImageNet), a unified scheme for training any backbone on ImageNet that does not require adjustments or hyperparameter tuning between different models, and consistently yields top model results in terms of accuracy and efficiency.
In the new paper Hierarchical Text-Conditional Image Generation with CLIP Latents, an OpenAI research team combines the advantages of contrastive and diffusion models for text-conditional image generation tasks. Their proposed unCLIP model improves image diversity with minimal loss in photorealism and caption similarity, and produces image quality comparable to the state-of-the-art text-to-image system GLIDE.