Meta’s Large Concept Models (LCMs) Are Here To Challenge And Redefine LLMs

A deep dive into ‘Large Concept Model’, a novel language processing architecture and evaluating its performance against state-of-the-art LLMs

Jan 02, 2025

Image obtained from the original research paper

LLMs, or Large Language Models, are dominating all language-based tasks today.

Alongside language tasks, their extension to images, video, and speech has led to state-of-the-art performance as well.

Most LLMs are based on the Decoder-only Transformer architecture.

This architecture is trained to predict the next token based on the context of preceding tokens (the Next-word prediction objective).

Consider the architecture of GPT as shown below.

GPT Visualised (Image from author’s book “AI In 100 Images”)

Despite their remarkable performance, this architecture is far from how human cognition works.

Instead of thinking in words, humans understand, reason about and generate ideas at multiple levels of abstraction.

Think about a teacher.

They do not plan every spoken word during their lecture.

Instead, in their mind, they outline a flow of high-level ideas they want to discuss.

These ideas can be distinct or depend on each other. They can also be in different languages or modalities (sound, image, text, etc.).

In each case, the teacher never considers every single word associated with them.

Instead, these high-level abstract ideas are translated by adding details at the lower levels of abstraction using different words each time they are discussed.

Such hierarchical high-level abstraction is not explicitly present in the current LLM architecture. (Although they might implicitly learn such representations during their training.)

Building on this insight, Meta researchers published a new architecture called the Large Concept Model or LCM.

Instead of reasoning and processing at the token level, LCMs do these in an abstract embedding space.

This embedding space is designed to be independent of language or modality.

Compare this to the current LLMs that are English-centric and token-based.

LCMs process, reason in and generate abstract atomic ideas called ‘Concepts’ that could correspond to a sentence in a text document or an equivalent speech utterance in an audio clip. (These could even be smaller or larger than a sentence/ single utterance.)

These generated Concepts can be decoded into any language and modality (that is supported by the embedding) in a purely zero-shot fashion without re-training the model.

Interestingly, they outperform existing LLMs of similar size in this task and compare well with them in other language-based tasks.

Here is a story where we deep dive into how Large Concept Models work and see how they stand next to state-of-the-art LLMs.

Let’s begin!

An Overview Of How An LCM Works

A Large Concept Model or LCM takes an input of ‘Concepts’.

For a text dataset, it is first segmented into sentences of 10 to 20 tokens in length. These sentences represent ‘Concepts’ for the model.

(Note that a ‘Concept’ can even be bigger than a sentence.)

These sentences are converted into embeddings using SONAR. (We will come back to this soon.)

The sequence of sentence embeddings is then fed to the LCM, which processes them to output a new sequence of embeddings.

These embeddings are again decoded back to sentences (Concepts) using SONAR.

The encoder and decoder used in the process are fixed and not trained with the model.

It is interesting to note that the output of an LCM can be decoded into any other language or modality, irrespective of the input language or modality, without performing any repeat processing.

Overview of how an LCM works (Image from the original reserach paper)

Let’s understand this process step by step.

How Is Text Broken Down Into Sentences?

The process starts with segmenting large corpora of text into sentences.

Reserachers use the Segment any Text (SaT) technique for this.

Segmentation is capped at 200 characters, as they found that this number produces the best results in their evaluations.

Segmentation of different text examples using Segment any Text (SaT) (Source)

How Are The Concept Embeddings Generated?

Sentences are then converted into embeddings using SONAR or Sentence-Level Multimodal and Language-Agnostic Representations.

SONAR embeddings are trained as an NLLB-1B Encoder/ Decoder architecture using the Mean Square Error (MSE) loss.

After training the text embeddings, they are extended to the speech modality (initialised using the W2v-Bert 2.0 Encoder) using the teacher-student approach of Knowledge Distillation, again using the MSE loss.

SONAR impressively supports 200 languages for text processing into and out of English and 76 languages for speech input and output in English.

Overview of training SONAR text and speech embeddings (Image from the original research paper)

The LCM Architecture

The Large Concept Model is essentially a decoder-only Transformer that is enhanced using two additional components that handle Concept embeddings.

These are called Pre-Net and Post-Net.

The PreNet normalizes the SONAR embeddings and maps them into the model’s hidden dimension.

The PostNet reverses the normalization and maps the model’s outputs back into the SONAR embedding space.

Equations for PreNet and PostNet operations where W(pre), W(post), b(pre) and b(post) are learned parameters

Normalize and denormalize operations where x is an input SONAR embedding, μ and σ are the median and Interquartile range (IQR) of the dataset’s SONAR embeddings, respectively.

The researchers termed this structure the Base-LCM (as there are a few more LCM variants).

The LCM operates with the next-concept (next-sentence embeddings) prediction objective.

In other words, it predicts the next Concept based on the preceding ones (in an autoregressive manner).

This is done by minimizing the Mean Squared Error (MSE) between the predicted and the ground truth embedding.

Mean Squared Error (MSE) is calculated between the predicted embedding x^(n) and the ground truth embedding x(n)

For a data distribution of sequences of concepts denoted by q, the overall training loss is shown below:

An ‘End of Text’ (eot) suffix is added to the end of the sequence of concepts, and this suffix is encoded using SONAR as well.

The model learns to predict this token to indicate the end of the sequence.

At inference time, text generation stops based on two conditions:

The cosine similarity between the generated embedding and the eot embedding crosses a certain threshold (represented by s(eot))
The cosine similarity between consecutive embeddings crosses a certain threshold (represented by s(prev))

Both s(eot) and s(prev) are set to 0.9 in the original model implementation.

Architecture of the Base-LCM (Image from the original research paper)

Moving Forward To A Diffusion Based LCM

Training the Base-LCM by minimizing MSE loss results in the model producing deterministic responses and predicting an average representation of all possible continuations.

This averaged embedding might not have any meaningful translation.

It might also not be creative enough for language tasks with multiple valid continuations.

To fix the poor results obtained by the Base LCM, researchers use Diffusion with the model so that it can generate meaningful and diverse next-concept embeddings, probabilistically.

With Diffusion, an LCM can learn a conditional probability distribution over sentence embeddings and sample from it during inference time.

Similar to how diffusion generates images, in a Diffusion-based LCM, a forward noising process adds noise to sentence embeddings, and a reverse denoising process is used to predict the original noiseless embeddings.

Let’s learn this in more detail.

The Two Steps Of Diffusion

In the Forward noising process, sentence embeddings (x(0)) are gradually corrupted (noised) with Gaussian noise over a series of timesteps (t) to produce noisy embeddings (x(t)).

A Noise schedule determines how the added noise increases over time.

This process is shown using the equation below:

Forward noising process where α(t) and σ(t) are functions of a noise schedule that controls the corruption level over time

In the Reverse denoising process, a Transformer-based model is trained to predict the original or noiseless embeddings (x(0)) from the noisy embeddings (x(t)).

Reverse denoising process where μ(θ) is the mean predicted by the model for denoising and Σ(θ) is the variance that is usually fixed.

The objective for training a Diffusion model is to minimize the reconstruction error of the clean embeddings.

Reconstruction loss where ϵ is the actual noise added during the forward process and ϵ(θ)(x(t), t) is the model’s prediction of the added noise during the forward process.

During inference time, the model starts with pure Gaussian noise and iteratively denoises it to produce embeddings.

There’s another idea that is important to learn before we discuss Diffusion-based LCMs. This is Classifier-Free Guidance.

Classifier-Free Guidance (CFG)

This is a technique that allows diffusion models to generate outputs based on a given conditioning input without requiring a separate classifier.

For example, generating an image based on a prompt.

(Before the invention of CFG, a separate classifier was trained with a diffusion model that guided the generation process aligned to a conditioning input. This technique was called Classifier-based guidance.)

Classifier-free guidance works by training a diffusion model to generate both conditioned (output p(x ∣ y), where y is the conditioning variable) and unconditioned outputs (output p(x)).

At inference time, the model combines these to balance generation quality and diversity using the following equation:

Equation to combine the unconditioned and conditioned outputs at inference time, where γ controls the strength of conditioned generation or guidance

Influenced by the processes described above, two LCM variants are created:

One-Tower Diffusion LCM
Two-Tower Diffusion LCM

Let’s discuss them further.

One-Tower Diffusion LCM

This LCM uses a single Transformer for both noising and denoising processes.

Since just one component is involved in processing, the model is called ‘One-Tower’.

It predicts the clean next sentence embedding given a noisy input conditioned on previous embeddings.

First, each embedding is concatenated with the corresponding diffusion timestep embedding to give the model information about the noise level.

Next, learned position embeddings are added to the input embeddings to encode the sequential structure.

This prepares the embeddings that are fed to the LCM.

The Transformer of the LCM uses Causal multi-head self-attention to process this sequence of embeddings.

During training, the input consists of interleaved noisy and clean embeddings, and the attention mask is specifically designed to allow the noisy embeddings to attend only to the clean embeddings.

This teaches the model to condition its denoising process properly.

Training of the One-Tower Diffusion LCM by interleaving clean (light blue) and noisy (blue) embeddings and sampling different diffusion timesteps (Image from the original research paper)

To enable classifier-free guidance at inference time, Self-attention is occasionally dropped based on a certain probability during training.

This allows the model to learn both conditional and unconditional generation.

Inference with One-Tower Diffusion LCM (Image from the original research paper)

Two-Tower Diffusion LCM

This architecture consists of two components or ‘Towers’ that handle processing context and denoising separately.

The first Tower is called Contextualizer.

This is a decoder-only Transformer with Causal Self-attention which is tasked to encode the preceding context embeddings.

The outputs of this tower are fed to the second tower called the Denoiser.

This tower consists of a stack of Transformer blocks with Cross-attention, which is tasked to iteratively denoise the noisy next embedding to predict the clean embedding.

Each layer of the denoiser uses the Adaptive Layer Normalization (AdaLN) mechanism that helps modulate it to handle varying levels of noise during the denoising process.

Also, the Self-attention layers in the denoiser only attend to the current position and not the preceding noised context.

Training of the Two-Tower Diffusion LCM (Image from the original research paper)

During training, rows of the Cross-attention mask in the denoiser are randomly dropped with a certain probability. This enables classifier-free guidance during inference.

At inference time, the model starts with a random noise vector.

The denoiser iteratively denoises it guided by the context from the contextualizer and the AdaLN-modulated cross-attention.

Inference with Two-Tower Diffusion LCM (Image from the original research paper)

There’s A Quantized LCM As Well

Lastly, another variant of LCM is trained where, instead of using the continuous sentence embeddings from the SONAR space, these are quantized or converted into a discrete form.

A technique called Residual Vector Quantization (RVQ) is used for this process.

Two models, Quant-LCM-d and Quant-LCM-c, are then trained.

The former predicts discrete units, while the latter predicts continuous residuals to refine embeddings during the generation process.

Which LCM Variants Perform The Best?

All four LCM variants with 1.6 billion trainable parameters are first pre-trained on the Fineweb-edu dataset.

Performance on Pre-Training

When Pre-training is evaluated, Base-LCM shows the lowest L2 scores but the poorest performance on Contrastive accuracy (CA) and Mutual information (MI) compared to Diffusion-based LCMs and Quant-LCM.

The reason?

Since many valid next-sentence continuations exist for a given context when Base-LCM generates an average next-sentence continuation (by optimizing MSE loss), this may not correspond to any meaningful embedding in the SONAR space.

Also, if you’re new to these metrics —

Contrastive Accuracy (CA) measures whether the predicted embedding is closer to the true next-sentence embedding than to other unrelated embeddings in the batch.
Mutual Information (MI) measures how well the predicted sentence aligns contextually with the preceding sentences.

Both Diffusion-based models attain the highest mutual information (MI) with no consistent performance difference between them.

This shows that the embeddings they produced effectively align with realistic text continuations.

Pre-training evaluation results using L2 distance (l2), Round-trip L2 distance (l2r), Contrastive accuracy (CA), Paraphrasing (PAR), and Mutual information (MI) on four datasets (Image from the original research paper)

Performance on Instruction-tuning

All pre-trained models are next fine-tuned on the Cosmopedia dataset.

Upon evaluation, Diffusion-based models outperform both Quant-LCM and Base-LCM.

A small Llama model with 24 transformer layers and 1.4 billion parameters is pre-trained and fine-tuned in similar ways and compared to the LCMs as well.

Notably, this model, called SmaLlama, still outperforms all LCM models on the given metrics, pointing towards better fluency of generated text.

Instruction-tuning evaluation results using ROUGE-L (R-L) and Coherence metrics (Arrows tell that higher values are better) (Image from the original research paper)

How Do LCMs Compare With LLMs?

Inference Efficiency

LCMs work with Concepts rather than tokens, which leads to shorter embeddings.

Since the Attention mechanism used in Transformers has quadratic complexity for the sequence length, the computational cost and interference time for LCMs are far lower than for LLMs.

It is only for extremely short sentences (<10 tokens) that an LLM is more computationally efficient than an LCM.

Computational efficiency measured in FLOPs of different models across different context sizes. (Image from the original research paper)

Short-Context Summarization Task

For this task, the Two-Tower LCM scaled to 7 billion parameters has competitive generative performance to many instruction fine-tuned LLMs (higher ROUGE-L score), has fewer repetitions in the generated text (lower REP-4 score) and generates more abstractive summaries (lower OVL-3 score).

On the other hand, it is less fluent than other LLMs (lower CoLA score).

Performance on Short-context Summarization task on CNN DailyMail and XSum datasets. ‘IT’ stands for Intrusction tuned. (Image from the original research paper)

Long-Context Summarization Task

When evaluated for this type of task on the LCFO dataset, the Two-Tower LCM outperforms Mistral-7B-v0.3-IT and Gemma-7B-IT (higher ROUGE-L score) for compressed summary tasks (5% and 10% of source length).

LCM also has high semantic relevance for the summaries to the source for all conditions (high SH-5 score).

Performance on Long-context Summarization task on the LCFO dataset (Image from the original research paper)

Summary Expansion Task

On this task of generating longer text for a given summary, LLMs get higher ROUGE-L scores than LCMs.

LCMs generate more paraphrased content but score lower on fluency.

Performance on Summary expansion task CNN DailyMail and XSum datasets. WR stands for Word count ratio, which is the relative length of the generated expansion compared to the input summary. (Image from the original research paper)

Zero-Shot Multi-Lingual Performance

SONAR embeddings support text in 200 languages and speech in 76 languages.

When an LCM, backed by SONAR, is evaluated on the XL-Sum dataset for multilingual abstractive summarization in 42 languages, its results are quite impressive.

As compared to Llama-3.1–8B-IT (that has been fine-tuned in different languages), Two-Tower LCM (that has never seen training data language other than English) scores higher in Multilingual ROUGE-L score in English and other languages supported by both models.

LCMs also generalize well to other low-resource languages (like Vietnamese, Hausa and Burmese) in a zero-shot setting (languages it has never seen before).

Multilingual ROUGE-L score for Two-Tower-7B-LCM and Llama-3.1–8B-IT on the XL-Sum dataset (Image from the original research paper)

LLMs Still Remain Undefeated. But For How Long?

LLMs stand strong in these evaluations.

This is because next-sentence (Concept) prediction is more challenging than next-token prediction.

This is because the number of possible sentences, given some context, is virtually unlimited. (Compare this to token vocabularies that are usually in the range of 100k.)

LLMs produce logits for each token in the vocabulary and then use a Softmax output to convert these into probabilities.

The token with the highest probability is selected as the next token (if using Greedy decoding), or it can be sampled based on a probability distribution (Temperature sampling).

LCMs, on the other hand, generate text in a continuous embedding space rather than choosing tokens from a vocabulary.

Such a probability distribution can theoretically be learned using Diffusion (and that’s why Diffusion-based LCMs perform better than others), but they are still not at their best.

LCMs still have a long way to go to reach the performance of current state-of-the-art LLMs, but they are really promising contenders.

It would be exciting to see how their story unfolds when blindly scaling LLMs no longer works.

Into AI

Discussion about this post