An LLM With A Visual Sketchpad Can Now Smash Its Competitors Without One (Even GPT-4o)

A deep dive into the “Sketchpad” framework that enables LLMs to draw and reason via the “Visual Chain-of-Thought Prompting” approach

Aug 16, 2024

Humans have been using Sketching as a tool for formulating ideas, communicating them and using them to solve problems for ages.

Think about all the cave paintings that still make sense of what they are about.

Or the first images you created as a child, haphazardly drawing with multiple crayons on a blank canvas when you did not yet know how to speak.

Sketching somehow preserves and propagates knowledge like text never can.

This was an important insight that stuck with the researchers of a recent pre-print on ArXiv.

They introduced a framework called Sketchpad, which gives multi-modal LLMs a visual sketchpad and the tools to draw on it.

The framework allows these LLMs to draw intermediary sketches to boost their reasoning ability when prompted.

And yes, it works wonders!

Sketchpad significantly enhances task performance compared to other LLMs that do not utilize sketching, resulting in an average improvement of 12.7% on math tasks and 8.6% on vision tasks.

Notably, when Sketchpad is used with GPT-4o, it sets a new state-of-the-art performance on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%) and visual correspondence (80.8%) benchmarks.

Sketchpad enables GPT-4o to arrive at the correct solution, which it cannot achieve without Sketchpad. (Image from the original research paper)

It is also noted that human evaluations show high agreement with Sketchpad-enabled GPT-4o’s plans, with 80% matching on geometry tasks and a 92.8% validity rating on vision tasks!

Here is a story in which we deep-dive into how the Sketchpad framework works, how it unlocks new insights into the inner workings of LLMs, and how it supercharges the performance of state-of-the-art LLMs like never before.

But First, Why Do LLMs Struggle With Mathematical & Visual Tasks?

Many LLMs perform well on tasks that can be solved by pure linguistic context but inherently lack the understanding of mathematics and visuospatial data.

Mathematical tasks often require step-by-step reasoning, the ability to handle abstract concepts, and meticulous logic application.

For visual tasks, the models need to be able to recognize complex multi-dimensional objects and relate them spatially.

Often, the training data lacks such features, or the architecture of the LLMs is not good enough to understand these patterns.

Researchers have previously tried to address these problems that LLMs face in mathematical tasks with better prompting techniques. One such technique is Chain-of-Thought Prompting.

Let’s talk about it.

What Is Chain-of-Thought Prompting?

Published in 2022 in ArXiv, Chain-of-Thought (CoT) is a prompting technique that allows LLMs to decompose a complicated reasoning task into small intermediate sub-problems.

Each of these sub-problems is tackled before the LLM gives the final answer.

Chain-of-thought prompting is conceptually similar to the Divide-and-Conquer algorithmic technique in that both methods break down complex tasks into simpler components and work on them before arriving at the final solution.

Both of these processes are similar to how the human thought process works when solving a complex problem.

Divide & Conquer visualised (Image from author’s upcoming book ‘Computer Science in 100 Images’)

CoT prompting involves crafting a prompt to guide an LLM through a series of intermediate reasoning steps before arriving at the solution.

This contrasts with the standard prompting approach, where reasoning steps are not explicitly included in the prompt.

Standard vs. Chain-of-Thought Prompting. Chain-of-thought reasoning processes are highlighted in the image. (Image from the research paper titled ‘Chain-of-Thought Prompting Elicits Reasoning in Large Language Models’ published in ArXiv)

It is seen that CoT prompting significantly improves an LLM’s performance on complex reasoning tasks, such as arithmetic, commonsense, and symbolic reasoning.

(Note that CoT described here is Few-shot CoT prompting.)

Some examples of triplets of Input, Chain of thought reasoning, Outputs for Arithmetic, Commonsense, and Symbolic reasoning benchmarks. Chains of thought reasoning processes are highlighted in the image. (Image from the research paper titled ‘Chain-of-Thought Prompting Elicits Reasoning in Large Language Models’ published in ArXiv)

This approach is improved with Zero-shot Chain-of-Thought prompting, which simply involves adding “Let’s think step by step” to the original prompt.

Examples of different types of Prompting approaches (Image from the research paper titled ‘Large Language Models are Zero-Shot Reasoners’ published in ArXiv)

Later in 2022, more work was done to devise another approach called Automatic Chain-of-Thought Prompting.

This approach automatically constructs demonstrations for Chain-of-Thought prompting in LLMs rather than manually doing this as in previous approaches, using diversity-based Clustering and Zero-shot (“Let’s think step by step”) prompts.

Overview of Automatic Chain-of-Thought prompting (Image from the research paper titled ‘Automatic Chain of Thought Prompting in Large Language Models’ published in ArXiv)

But What About Improving Performance On Visual Tasks?

Similar to CoT prompting, researchers have previously explored decomposing complex vision tasks into smaller and simpler sub-steps that can be solved using specialized vision tools.

Two such research works include VISPROG and ViperGPT.

These use LLMs to generate Python code that invokes the required vision tools to solve a sub-problem.

Given visual input and a query, ViperGPT synthesizes a program and executes it with the Python interpreter to produce the final answer. (Image from the research paper titled ‘ViperGPT: Visual Inference via Python Execution for Reasoning’ published in ArXiv)

Use of VISPROG for visual reasoning (Image from the research paper titled ‘Visual Programming: Compositional visual reasoning without training’ published in ArXiv)

Although quite admirable, these tools do not completely address the problem.

They follow a pre-defined plan and do not change it according to the intermediate visual cues produced. This frequently causes them to produce incorrect results.

Researchers have thus combined and improved upon these ideas to create the Sketchpad framework.

Let’s talk about it next.

Here Comes “Sketchpad”

Borrowing insights from previous research work and the Sketchpad framework enables multi-modal LLMs to draw sketches.

These sketches allow these models to reason during their intermediate steps to answer a query.

Think of it like Chain-of-Thought prompting but with intermediate visual reasoning steps, or call it “Visual Chain-of-Thought prompting”.

The framework can be used on any multi-modal LLM out of the box and requires no fine-tuning of the baseline model.

It is built upon the open-source AutoGen framework that allows developers to build LLM applications via multiple agents that can converse and coordinate with each other to accomplish tasks.

This is how it works with an LLM interactively:

Given a multi-modal query and the current context, the base LLM analyses it and generates a plan to address it (analogous to Thought). This query/prompt includes the Python function signatures and docstrings for the modules that can be used to solve the task.
Based on this plan, the LLM decides on an Action to take. This step involves generating Python code to execute its planned action.
The Sketchpad framework provides the environment (the tools and functions) to run the LLM-generated code, which calls for external libraries (such as matplotlib or networkx for plotting and other specialist vision models for segmentation, marking, masking, labelling and more).
Sketchpad’s environment then returns a new Observation to the LLM that updates its context.

This interaction continues till the LLM determines that it has enough information from its context to answer the given query.

Performance On Mathematical Problem-Solving Tasks

Sketchpad-integrated LLMs are evaluated on different mathematical tasks, and the results are shown below.

Geometry Problems

Problems from the Geometry3K dataset are used for this evaluation.

A problem example is shown below.

Given a query to solve a Geometry problem, Sketchpad uses the ‘matplotlib’ library to visualise the intermediate reasoning step required to reach a solution (Image from the original research paper)

Mathematical Function Solving

Problems from the IsoBench datasets are used for this evaluation.

Classifying Parity: To determine if a function is even, odd, or neither

The following prompt is given to the model for this task.

The prompt given to the LLM for solving the Math Parity task (Image from the original research paper)

The intermediate thought, action and observation steps are shown below.

Intermediate steps for solving the Math Parity task (Image from the original research paper)

2. Identifying Convexity/ Concavity: To determine whether a function is Convex or Concave

The following prompt is given to the model for this task.

The prompt given to the LLM for solving the Math Convexity task (Image from the original research paper)

The intermediate steps for this task are not shown in the original research paper.

Graph Problem Solving

Problems from the IsoBench datasets are used for this evaluation.

Graph Connectivity: To figure out whether there exists a path between two vertices in a graph

The following prompt is given to the model for this task.

The prompt given to the LLM for solving the Graph Connectivity task (Image from the original research paper)

The intermediate thought, action and observation steps are shown below.

Intermediate steps for solving the Graph Connectivity task (Image from the original research paper)

2. Graph Maximum Flow: To determine the maximum flow that can be sent through a network from a source to a sink vertex, considering the capacity constraints on the edges

The following prompt is given to the model for this task.

The prompt given to the LLM for solving the Graph Maximum Flow task (Image from the original research paper)

3. Graph Isomorphism Task: To figure out if two graphs are structurally equivalent

The prompt given to the model for this task is shown below.

The prompt given to the LLM for solving the Graph Isomorphism task (Image from the original research paper)

The original research paper does not show the intermediate steps to both of the above tasks.

Game Strategy Formulation & Analysis

Problems from the IsoBench datasets are used for this evaluation to find the outcome of a chess game.

The following prompt is given to the LLM, which uses Python’s chess library to draw chess boards based on the Forsyth–Edwards Notation.

The prompt given to an LLM to analyze a chess game outcome (Image from the original research paper)

Again, the original research paper does not show the intermediate steps for this task.

Results For Mathematical Problem-Solving Tasks

It is seen that Sketchpad leads to large performance gains for GPT-4 models across almost all tasks to outperform all other baseline models.

Accuracy scores on mathematical problem-solving tasks (Image from the original research paper)

Performance On Computer Vision Tasks

Sketchpad-integrated LLMs are evaluated on different complex visual reasoning tasks based on the V*Bench, BLINK and MMVP benchmarks.

A few examples of these tasks are shown below.

Examples of Sketchpad applied to Computer Vision tasks (Image from the original research paper)

LLMs are prompted to use different specialist vision tools to sketch and manipulate the given images to solve these tasks, as displayed below.

Beginning of the prompt given to the LLM for a computer vision task (Image from the original research paper)

Functions in the prompt given to the LLM for a computer vision task (Image from the original research paper)

End of the prompt given to the LLM for a computer vision task (Image from the original research paper)

Results For Visual Reasoning Tasks

It is found that Sketchpad enhances the performance of GPT-4 Turbo and GPT-4o, outshining other baseline models to reach a new state-of-the-art performance on all tasks.

Accuracy scores on Computer Vision tasks (Image from the original research paper)

Cost Of Running Sketchpad

Sketchpad’s per-sample cost using GPT-4o ranges from $0.011 to $0.133.

This is more for visual tasks than mathematical tasks due to increased token usage.

Cost of running Sketchpad on different mathematical problem-solving and computer vision tasks (Image from the original research paper)

Although Sketchpad increases the computational resources required to answer queries, its results are mind-blowing, and this research could be a significant step towards more human-like multi-modal intelligence in LLMs.

What are your thoughts about it? Let me know in the comments below.

Into AI

Discussion about this post

Into AI

An LLM With A Visual Sketchpad Can Now Smash Its Competitors Without One (Even GPT-4o)

A deep dive into the “Sketchpad” framework that enables LLMs to draw and reason via the “Visual Chain-of-Thought Prompting” approach

But First, Why Do LLMs Struggle With Mathematical & Visual Tasks?

What Is Chain-of-Thought Prompting?

But What About Improving Performance On Visual Tasks?

Here Comes “Sketchpad”

Performance On Mathematical Problem-Solving Tasks

Geometry Problems

Mathematical Function Solving

Graph Problem Solving

Game Strategy Formulation & Analysis

Results For Mathematical Problem-Solving Tasks

Performance On Computer Vision Tasks

Results For Visual Reasoning Tasks

Cost Of Running Sketchpad

Further Reading

Discussion about this post