‘Transfusion’ Is Supercharging Training Multi-Modal LLMs Like Never Before

A deep dive into how a novel approach called ‘Transfusion’ enables training a single Transformer model using both text and images, resulting in superior performance compared to traditional multi-modal

Oct 07, 2024

∙ Paid

Multimodal LLMs are gaining popularity.

Give them text, images, audio or video, and they will work with it all.

Conventional multimodal LLMs consist of separate, modality-specific architectures trained differently for different data types.

The Next token prediction objective is typically used for discrete data such as text and code, while Diffusion is used for continuous data such as images, audio, and video.

How about combining them both to train a single model?

Researchers at Meta, Waymo, and the University of Southern California, in their recent preprint in ArXiv, have made this possible.

They introduce a technique called Transfusion that combines the next token prediction objective with diffusion to train a single Transformer over mixed-modality data.

They show that scaling this method to 7 billion parameters results in a model that can generate text and images at par with similar-scale diffusion and language models.

Images generated from a 7-billion-parameter Transfusion modal trained on 2 trillion multi-modal tokens (Image from the original research paper)

Here’s a story where we deep dive into how this novel approach works and how it outperforms previous work that aims to do so.

Let’s begin!

But First, What Is Next Token Prediction?

Next token prediction is an approach taken by an LLM in which it tries to guess which word (or part of the word) or token will come next in a sequence of words/ tokens.

To achieve this, it considers the words in its training data and learns to predict the next word.

More technically, given a sequence of words/tokens from a fixed vocabulary, the LLM predicts the probability of the next token based on the preceding tokens in the sequence.

It is trained using an Autoregressive classification approach where it learns to predict the next token using a probability distribution that is optimized by minimizing the Cross-entropy between the predicted and actual data distributions.

This loss (conventionally called LM or Language Model loss) is represented as —

The Cross-entropy loss for a language model where y(i) represents the actual token in the sequence at position i, P(θ) is the model’s predicted probability of the next token based on the previous ones, and θ represents the model parameters. (Image from the original research paper)

After training, the model samples its learned probability distribution to generate the next token based on the Temperature and Top-p truncation.

Next, What Is Diffusion?

The Diffusion process in Machine Learning is based on the physical process of Diffusion, in which particles present in areas of their higher concentration move to areas of their lower concentration, eventually dispersing evenly in an equilibrium state.

Diffusion of purple dye in water (Source: Image by BruceBlaus on Wikimedia Commons)

Denoising Diffusion Probabilistic Models (DDPMs) use this concept for training.

Unlike language models that work with discrete data, such as text tokens, DDPMs operate over continuous data vectors, like images.

In Diffusion, non-discrete continuous training data (e.g. images) is gradually corrupted by adding Gaussian noise step by step (Forward process).

A pre-defined noise schedule, such as the Cosine scheduler, parameterizes (β) this process and determines how much noise is added at each step.

The model is then trained to reverse this process by learning how to remove this noise and revert to the original data (Reverse process).

The overall model’s training objective is to minimize the difference between the predicted and the actual noise added at each time step using a loss function such as the Mean Squared error.

Once trained, generating new data involves beginning with random noise and iteratively denoising it to create high-quality, meaningful samples.

A technique called Classifier-Free Diffusion Guidance conditions the model's image generation based on a given caption.

And Then There Are VAEs & LDMs

Early Diffusion models work directly with image pixels, and hence, their approach is computationally intensive.

Modern Diffusion techniques use Variational Autoencoders (VAEs) to create a latent representation of original image data, where smaller image patches are converted into vectors.

Then, models called Latent Diffusion models efficiently work on this latent representation, making tasks like image generation less computationally expensive.

Keep reading with a 7-day free trial

Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.