AI Machine Learning & Data Science Research

DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints

A DeepMind research team introduces JetFormer, a Transformer designed to directly model raw data. This model maximizes the likelihood of raw data without depending on any pre-trained components, and is capable of both understanding and generating text and images seamlessly.

Recent advancements in training large multimodal models have been driven by efforts to eliminate modeling constraints and unify architectures across domains. Despite these strides, many existing models still rely on separately trained components such as modality-specific encoders and decoders.

In a new paper JetFormer: An Autoregressive Generative Model of Raw Images and Text, a Google DeepMind research team introduces JetFormer, a groundbreaking autoregressive, decoder-only Transformer designed to directly model raw data. This model maximizes the likelihood of raw data without depending on any pre-trained components, and is capable of both understanding and generating text and images seamlessly.

The team summarizes the key innovations in JetFormer as follows:

  1. Leveraging Normalizing Flows for Image Representation: The pivotal insight behind JetFormer is its use of a powerful normalizing flow—termed a “jet”—to encode images into a latent representation suitable for autoregressive modeling. Traditional autoregression on raw image patches encoded as pixels has been impractical due to the complexity of their structure. JetFormer’s flow model addresses this by providing a lossless, invertible representation that integrates seamlessly with the multimodal model. At inference, the flow’s invertibility enables straightforward image decoding.
  2. Guiding the Model to High-Level Information: To enhance focus on essential high-level information, the researchers employ two innovative strategies:
  • Progressive Gaussian Noise Augmentation: During training, Gaussian noise is added and gradually reduced, encouraging the model to prioritize overarching features early in the learning process.
  • Managing Redundancy in Image Data: JetFormer allows selective exclusion of redundant dimensions in natural images from the autoregressive model. Alternatively, Principal Component Analysis (PCA) is explored to reduce dimensionality without sacrificing critical information.

The team evaluated JetFormer on two challenging tasks: ImageNet class-conditional image generation and web-scale multimodal generation. The results show that JetFormer is competitive with less flexible models when trained on large-scale data, excelling in both image and text generation tasks. Its end-to-end training capability further highlights its flexibility and effectiveness.

JetFormer represents a significant leap in simplifying multimodal architectures by unifying modeling approaches for text and images. Its innovative use of normalizing flows and emphasis on high-level feature prioritization marks a new era in end-to-end generative modeling. This research lays the groundwork for further exploration of unified multimodal systems, paving the way for more integrated and efficient approaches to AI model development.

The paper JetFormer: An Autoregressive Generative Model of Raw Images and Text is on arXiv.


Author: Hecate He | Editor: Chain Zhang


12 comments on “DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints

  1. Pingback: DeepMind’s JetFormer: Unified Multimodal Models Without Modelling Constraints - Welcome

  2. The article published is on point everything is perfect. Overall its an amazing article. Thanks for sharing.

  3. Obtaining a Business Registration Certificate is essential for establishing your business. SmartCorp offers various services such asTrademark Registration in Coimbatore

  4. Excellent company registration services here:”
    GST Registration in Coimbatore

  5. Aleharro

    Historically, AI systems have been constructed using siloed architectures—vision models trained on image data with convolutional backbones, text models optimized for language tasks using transformers, and so on. When researchers started building multimodal systems, the intuitive step was to combine these expert models. Take a vision encoder like CLIP, a language model like BERT or GPT, and bolt them together with a fusion layer.

  6. Timbelikih

    As we look to the future, the dream isn’t just a model that can caption an image or answer a question—it’s one that can reason across sight, sound, language, and action with the coherence and context of a unified mind. And use of bg remove based on artificial intelligence made a breakthrough in this. With tools like RemoveBG.net, you don’t need to be a Photoshop wizard to create sleek, standout visuals. Just upload, remove, and use—it’s that simple.

  7. Rowanl Lebsackl

    This is truly exciting! The JetFormer model sounds like a game-changer, moving us closer to a truly unified architecture. Retro bowl

  8. This was a really insightful read on how AI alignment is evolving. The idea of self-evolving prompts is fascinating and could truly shape the future of responsible AI. I was reading about this while converting a related videos, and it’s amazing how tools that make exploring these topics even easier. Thanks for breaking down the concepts so clearly!

  9. Stuart Freeman

    Hey @Geometry Dash Lite,
    This is fascinating — especially the part about using progressive Gaussian noise to guide high-level understanding. It’s a clever way to help the model focus on structure instead of getting lost in details.

  10. This asymmetric self play approach seems like a powerful way to scale alignment without relying on fixed human crafted prompts. That said, as we build more adaptive, self improving AI, the importance of transparency and safety mechanisms becomes even more critical – just like in the physical world, when installing protective window films on buildings to shield against harmful UV or intruders, you want trusted professionals like window film installation companies who know exactly what they are doing. See companies like Window Film Installation for how specialized installers approach security and protection in a practical, responsible way https://windowfilminstallers.com

  11. Superb work! This writing sets a new standard for explaining this topic.

  12. EZ Pass NJ is New Jersey’s electronic toll collection system, letting you breeze through tolls on roads like the NJ Turnpike and Garden State Parkway. It saves time and money with discounts. This article makes EZ Pass NJ login effortless, guiding you to manage your account securely. EZPassVA

Leave a Reply

Your email address will not be published. Required fields are marked *