vision transformer

by Synced 2024-10-07 3

Instant 3D Vision: Apple’s Depth Pro Delivers High-Precision Depth Maps in 0.3 Seconds

Apple introduces Depth Pro, a state-of-the-art foundation model designed for zero-shot metric monocular depth estimation. This model can generate high-resolution depth maps with exceptional clarity and fine detail, producing a 2.25-megapixel depth map in just 0.3 seconds on a standard GPU.

by Synced 2024-08-27 4

AI Machine Learning & Data Science Research

Meta’s Sapiens: Revolutionizing Human Pose, Segmentation, and Depth Estimation with Vision Transformers

In a new paper Sapiens: Foundation for Human Vision Models, a Meta research team introduces Sapiens, a suite of models designed to address four core human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.

by Synced 2023-10-26 3

AI Machine Learning & Data Science Research

DeepMind Verifies ConvNets Can Match Vision Transformers at Scale

In a new paper ConvNets Match Vision Transformers at Scale, a Google DeepMind research team challenges the prevailing belief that Vision Transformers possess superior scaling capabilities compared to ConvNets and provides empirical results revealing that ConvNets can indeed hold their own against Vision Transformers at scale.

by Synced 2023-07-17 10

AI Computer Vision & Graphics Machine Learning & Data Science Research

DeepMind Proposes Novel Vision Transformer for Arbitrary Size & Resolution

In a new paper Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, a Google DeepMind research team further improves ViT with Native Resolution ViT (NaViT), which is able process input sequences of arbitrary resolutions and aspect ratios.

by Synced 2023-06-19 2

AI Computer Vision & Graphics Machine Learning & Data Science Research

DeepMind Claims Image Captioner Alone Is Surprisingly Powerful then Previous Believed, Competing with CLIP

In a new paper Image Captioners Are Scalable Vision Learners Too, a DeepMind research team presents CapPa, a image captioning based pretraining strategy that and can compete CLIP and exhibit favorable model and data scaling properties, verifying that a plain image captioning can be a competitive pretraining strategy for vision backbones.

by Synced 2023-04-18 6

AI Computer Vision & Graphics Machine Learning & Data Science Research

Microsoft & Bath U’s SpectFormer Significantly Improves Vision Transformers via Frequency and Attention

In the new paper SpectFormer: Frequency and Attention Is What You Need in a Vision Transformer, a research team from Microsoft and the University of Bath proposes Spectformer, a novel transformer architecture that combines spectral and multi-headed attention layers to better capture appropriate feature representations and improve performance.

by Synced 2022-12-21 0

AI Computer Vision & Graphics Machine Learning & Data Science Research

Meet Google’s FlexiViT: A Flexible Vision Transformer for All Patch Sizes

In the new paper FlexiViT: One Model for All Patch Sizes, a Google Research team presents FlexiViT, a flexible ViT that performs well across a wide range of patch sizes, matching or outperforming standard fixed-patch ViT performance with no extra costs.

by Synced 2022-08-17 1

AI Machine Learning & Data Science Research

‘A Promising Direction for Semi-Supervised Learning’ – AWS Lab’s Semi-ViT Beats CNNs While Maintaining Scalability

In the new paper Semi-supervised Vision Transformers at Scale, a research team from AWS AI Labs proposes a semi-supervised learning pipeline for vision transformers that is stable, reduces hyperparameter tuning sensitivity, and outperforms conventional convolutional neural networks.

by Synced 2022-06-29 178

AI Computer Vision & Graphics Machine Learning & Data Science Research

NVIDIA’s Global Context ViT Achieves SOTA Performance on CV Tasks Without Expensive Computation

In the new paper Global Context Vision Transformers, an NVIDIA research team proposes the Global Context Vision Transformer, a novel yet simple hierarchical ViT architecture comprising global self-attention and token generation modules that enables the efficient modelling of both short- and long-range dependencies without costly compute operations while achieving SOTA results across various computer vision tasks.

by Synced 2022-06-06 1

AI Computer Vision & Graphics Machine Learning & Data Science Research

Snap & NEU’s EfficientFormer Models Push ViTs to MobileNet Speeds While Maintaining High Performance

In the new paper EfficientFormer: Vision Transformers at MobileNet, a research team from Snap Inc. and Northeastern University proposes EfficientFormer, a vision transformer that runs as fast as MobileNet while maintaining high performance.

by Synced 2022-04-07 2

AI Machine Learning & Data Science Research

Kaiming He’s MetaAI Team Proposes ViTDet: A Plain Vision Transformer Backbone Competitive With Hierarchical Backbones on Object Detection

A Meta AI research team explores the plain, non-hierarchical vision transformer (ViT) as a backbone network for object detection, proposing a ViT Detector that achieves performance competitive with traditional hierarchical backbones.

by Synced 2022-03-25 2

AI Machine Learning & Data Science Research

Microsoft’s FocalNets Replace ViTs’ Self-Attention With Focal Modulation to Improve Visual Modelling

A Microsoft Research team proposes FocalNet (Focal Modulation Network), a simple and attention-free architecture designed to replace transformers’ self-attention module. FocalNets exhibit significant superiority over self-attention for effective and efficient visual modelling in real-world applications.

by Synced 2022-01-13 1

AI Computer Vision & Graphics Machine Learning & Data Science Research

Facebook AI & UC Berkeley’s ConvNeXts Compete Favourably With SOTA Hierarchical ViTs on CV Benchmarks

A team from Facebook AI Research and UC Berkeley proposes ConvNeXts, a pure ConvNet model that achieves performance comparable with state-of-the-art hierarchical vision transformers on computer vision benchmarks while retaining the simplicity and efficiency of standard ConvNets.

by Synced 2021-12-16 4

AI Computer Vision & Graphics Machine Learning & Data Science Research

NVIDIA’s AdaViT Halts Token Computation to Adaptively Adjust ViT Inference Cost on Images of Different Complexity

Nvidia researchers propose AdaViT, an input-dependent mechanism that adaptively adjusts vision transformers’ inference cost by halting the compute of different tokens at different depths to reserve compute for discriminative tokens.

by Synced 2021-11-22 1

AI Computer Vision & Graphics Machine Learning & Data Science Research

Microsoft Asia’s Swin Transformer V2 Scales the Award-Winning ViT to 3 Billion Parameters and Achieves SOTA Performance on Vision Benchmarks

Microsoft Research Asia has upgraded their Swin Transformer with a new version featuring three billion parameters to train images with resolutions up to 1,536 x 1,536 and advance the SOTA on four representative vision benchmarks.

by Synced 2021-11-17 0

AI Machine Learning & Data Science Research

Is BERT the Future of Image Pretraining? ByteDance Team’s BERT-like Pretrained Vision Transformer iBOT Achieves New SOTAs

A research team from ByteDance, Johns Hopkins University, Shanghai Jiao Tong University and UC Santa Cruz seeks to apply the proven technique of masked language modelling to the training of better vision transformers, presenting iBOT (image BERT pretraining with Online Tokenizer), a self-supervised framework that performs masked prediction with an online tokenizer.

by Synced 2021-11-09 2

AI Machine Learning & Data Science Research

Can ViT Layers Express Convolutions? Peking U, UCLA & Microsoft Researchers Say ‘Yes’

In the new paper Can Vision Transformers Perform Convolution?, a research team from Peking University, UCLA and Microsoft Research proves that a single ViT layer with image patches as the input can perform any convolution operation constructively, and show that ViT performance in low data regimes can be significantly improved using their proposed ViT training pipeline.

by Synced 2021-08-27 8

AI Computer Vision & Graphics Machine Learning & Data Science Research

Google Brain Uncovers Representation Structure Differences Between CNNs and Vision Transformers

A Google Brain research team explores the internal representation structures of ViTs and CNNs on image classification tasks, providing insights on key differences between the two approaches.