‘MathPrompt’ Embarassingly Jailbreaks All LLMs Available On The Market Today

A deep dive into how a novel LLM Jailbreaking technique called ‘MathPrompt’ works, why it is so effective, and why it needs to be patched as soon as possible to prevent harmful LLM content generation

Oct 14, 2024

∙ Paid

AI safety is of utmost priority for big tech.

LLMs are being trained to produce human values-aligned responses using Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) / Reinforcement Learning from AI Feedback (RLAIF).

They also have other safety mechanisms that prevent them from generating harmful content.

These mechanisms are stress-tested (a process called Red Teaming), and the detected vulnerabilities are regularly patched.

Despite these practices, we still haven’t yet reached a perfectly safe model (or even close to it).

We have previously seen techniques like Disguise and Reconstruction, where a user interacts with an LLM with a ‘Disguised’ / benign-looking prompt, which is reconstructed in the next stage to elicit a harmful response from GPT-4.

The ‘Disguise and Reconstruction’ Jailbreaking pipeline (Image from research article titled ‘Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction’ published in ArXiV)

We have seen other techniques, such as translating an unsafe prompt in English to a low-resource language and back, that result in harmful responses from the LLM.

Jailbreaking GPT-4 by translating unsafe prompt in English to Zulu and back (Image from research article titled ‘Low-Resource Languages Jailbreak GPT-4’ published in ArXiv)

Or, where simply giving Claude hundreds of fictitious harmful question-answer pairs in the prompt bypasses its safety mechanisms.

The ‘Many-shot Jailbreaking’ technique published by Anthropic

Most of these vulnerabilities must have been patched, but we have another novel technique in the town.

It is called ‘MathPrompt’.

And it jailbreaks all state-of-the-art LLMs popularly used today with a very high success rate.

Published in ArXiv, this technique intelligently encodes harmful prompts into mathematical problems for the LLMs to solve.

And yes, the LLM solves these problems, returning a harmful response.

Surprisingly, the technique achieves an average Attack Success Rate (ASR) of 73.6% across 13 state-of-the-art LLMs, with the highest being 87.5% on Claude 3 Haiku and 85% on GPT-4o.

Here’s a story where we explore how the ‘MathPrompt’ technique works, why it is so effective and why it needs to be patched as soon as possible to prevent harmful content generation by the affected LLMs.

(A little disclaimer before we begin: This story is for educational purposes only and not intended to help readers exploit AI vulnerabilities. Any such unethical action is at the reader’s own risk.)

Let’s go!

How Does ‘MathPrompt’ Work?

Modern-day LLMs have remarkable mathematical capabilities.

They can understand complex multi-step quantitative word problems, work with algebraic expressions, perform symbolic reasoning and even prove mathematical theorems.

‘MathPrompt’ misuses this ability of LLMs in a two-step process.

Step 1: Representing Natural Language Prompts in Symbolic Mathematics

This step converts natural language prompts into mathematical representations while preserving its meaning, structure and relationships.

Three mathematical branches are specifically used for this purpose:

Set Theory: to represent collections and relationships between terms using symbols such as ∈ (element of), ⊆ (subset), ∩ (intersection), ∪ (union).
Abstract Algebra: to represent structures (groups, rings, fields) and their operations to model processes.
Symbolic Logic: to represent logical relationships and conditions using symbols such as ∧ (and), ∨ (or), ¬ (not), → (implies), ∀ (for all), ∃ (there exists).

These complex mathematical prompts seem tough to produce, but worry not—the researchers simply prompted GPT-4o to do so.

Keep reading with a 7-day free trial

Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.