Kolmogorov-Arnold Networks (KANs) Are Being Used To Boost Graph Deep Learning Like Never Before

A deep dive into how Graph Kolmogorov-Arnold Networks (GKANs) are improving Graph Deep Learning to surpass traditional approaches

Jul 02, 2024

KANs have gained a lot of attention since they were published in April 2024.

They are being used to solve several machine-learning problems that previously used Multi-layer Perceptrons (MLPs), and their results have been impressive.

A team of researchers recently used KANs on Graph-structured data.

They called this new neural network architecture — Graph Kolmogorov-Arnold Networks (GKANs).

And, how did it go — you’d ask?

They found that GKANs achieve higher accuracy in semi-supervised learning tasks on a real-world graph dataset (Cora) than the traditional ML models used for Graph Deep Learning, i.e. Graph Convolutional Networks (GCNs).

This is a big step for KANs!

Here is a story where we dive deep into GKANs, learn how they are used with graph-structured data, and discuss how they surpass traditional approaches in Graph Deep Learning.

But First, What Is Graph Deep Learning?

Graphs are mathematical structures that consist of nodes (or vertices) and edges (or links) connecting these nodes.

Graph visualised (Image from author’s upcoming book ‘Computer Science In 100 Images’)

Some examples of real-world data that is structured in the form of graphs include:

Social network connections (Users as nodes and relationships as edges)
Recommendation systems (Items as nodes and user interactions as edges)
Chemical Molecules and compounds (Atoms as nodes and bonds as edges)
Biological molecules such as Proteins (Amino acids as nodes and bonds as edges)
Transportation networks like Roadways (Intersections as nodes and pathways as edges)

Graph Deep Learning is a set of methods developed to learn from such graph-structured data and solve problems based on this learning.

Some of these Graph Learning problems involve:

Graph classification (labelling a graph according to its properties)
Node classification (predicting the label of a new node)
Link prediction (predicting the existence of relations/ edges between nodes)
Graph generation (creating new graphs based on existing ones)
Community detection (identifying clusters of densely connected nodes within a graph)
Graph Embedding generation (generating low dimensional representations of higher dimensional graphs)
Graph clustering (grouping similar graph nodes together)
Graph anomaly detection (figuring out abnormal nodes or edges that do not match the expected pattern in a graph)

Traditionally, these problems have been solved using Graph Neural Networks (GNNs) and their variants (notably Graph Convolutional Networks), which use MLPs at their core.

Graph Neural Networks visualised (Image from author’s book ‘AI In 100 Images’)

Let’s explore Graph Convolutional Networks (GCNs) in a bit more detail.

What Are Graph Convolutional Networks?

A Graph Convolutional Network (GCN) combines a graph’s node features with its topology (or how the nodes are connected in space). This allows it to effectively capture the dependencies and relationships in the graph.

In other words, GCNs are based on the assumption that node labels y are mathematically dependent on both the node features X and the graph’s structure (i.e. its adjacency matrix A).

This can be mathematically expressed as:

y = f (X, A)

A multi-layer GCN updates the node representations by aggregating the information from neighbouring nodes using a layer-wise propagation rule:

Layer-wise propagation rule employed by a Graph Convolutional Network (Image from original research paper)

where:

Ã = A + I is the Augmented adjacency matrix or the graph's adjacency matrix with added self-connections for each node. It is the sum of the graph’s adjacency matrix with its identity matrix I.
D~ is a diagonal matrix of Ã where each diagonal element represents the degree of node i in the augmented graph
D~^(-1/2) Ã D~^(-1/2) is used to symmetrically normalize Ã, to make sure that each node’s influence is appropriately scaled by its degree. This normalized adjacency matrix is usually represented with Â.
H(l) is the matrix of node features at layer l
H(0) represents the initial node features (or X)
W(l) is the trainable weight matrix at layer l
σ represents an activation function (e.g. ReLU)

A simple two-layer GCN’s forward propagation (used for node classification) can be expressed as follows:

Forward propagation in a 2-layer GCN for graph node classification (Image from original research paper)

where:

Â is the normalized adjacency matrix
X is the input feature matrix
W(0) and W(1) are the weight matrices for the first and second layers, respectively. These weights are optimized using Gradient Descent.

Overview of a two-layer GCN (Image from original research paper)

Now that we know about GCNs let’s move on to learning about KANs.

Next, What Are KANs?

Kolmogorov-Arnold Network (KAN) is a novel and innovative neural network architecture based on the Kolmogorov-Arnold representation theorem.

They are a promising alternative to the currently popular MLPs that are based on the Universal Approximation Theorem.

The core idea behind KANs is to use learnable univariate activation functions (shaped as a B-Spline) on the edges and simple summations on the nodes of a neural network.

This contrasts with MLPs that use learnable weights on the edges while having a fixed activation function on the neural network nodes.

A comparison between MLPs and KANs (Image from the research paper titled ‘KAN: Kolmogorov–Arnold Networks’ published in ArXiv)

When compared with MLPs, KANs:

Lead to smaller computational graphs
Are more parameter-efficient and accurate
Converge faster and achieve lower losses
Have steeper scaling laws
Are highly interpretable
On the contrary, given the same number of parameters, they take longer to train compared to MLPs.

The Birth Of GKANs

Considering the advantages that KANs offer, researchers devised a novel hybrid architecture called Graph Kolmogorov-Arnold Networks (GKANs) that extended the use of KANs on graph-structured data.

They aimed to find out if GKANs could effectively learn from both labelled and unlabeled data in a semi-supervised setting and outperform traditional graph learning methods.

The team developed two GKAN architectures, which are described below.

GKAN Architecture 1: Activations After Summation

In this architecture, the learnable univariate activation functions are applied to the aggregated node features after the summation step.

In other words, the node embeddings are first aggregated using the normalized adjacency matrix, and then they are passed through the KAN layer.

The layer-wise propagation rule for GKAN Architecture 1 is shown below.

Layer-wise propagation rule for GKAN Architecture 1 (Image created by author)

where:

H(l) and H(l+1) represent the node feature matrix at layers l and l+1, respectively
Â is the normalized adjacency matrix
the KANLayer operation applies learnable univariate activation functions or B-Splines to the aggregated node features ÂH(l)

The forward propagation model for the architecture is expressed as:

Forward propagation model for GKAN Architecture 1 with L layers (Image created by author)

Overview of a two-layer GKAN Architecture 1 (Image from original research paper)

GKAN Architecture 2: Activations Before Summation

In this architecture, the learnable univariate activation functions are applied to the aggregated node features before the summation step.

In other words, the node features are first passed through the KAN layer and then summated using the normalized adjacency matrix.

The layer-wise propagation rule for GKAN Architecture 2 is shown below.

Layer-wise propagation rule for GKAN Architecture 2 (Image created by author)

where:

H(l) and H(l+1) represent the node feature matrix at layers l and l+1, respectively
Â is the normalized adjacency matrix
the KANLayer operation applies learnable univariate activation functions or B-Splines to each element of the input node features H(l)

The forward propagation model for the architecture is expressed as:

Forward propagation model for GKAN Architecture 2 with L layers (Image created by author)

Overview of a two-layer GKAN Architecture 2 (Image from original research paper)

The Performance Of GKANs

GKANs vs. GCN

Both GKAN architectures were first trained on the Cora dataset.

The Cora dataset is a citation network that consists of documents as nodes and the citation links between these documents as edges.

There are 7 different classes in this dataset that have 1433 features per document.

Next, their performance was compared to that of a conventional GCN, with a comparable number of parameters, over both train and test data using a subset of 200 features from the total 1433 available in the dataset.

And the results were quite incredible!

GKANs achieved higher accuracy than GCNs for both 100 and 200 feature sets.

On the first 100 features of the dataset, both GKAN architectures achieved higher accuracy than the GCN.

Notably, the GKAN Architecture 2 achieved 61.76% accuracy compared to 53.5% for GCN.

Performance of different architectures on the first 100 features of the Cora dataset, where k is the polynomial degree in the spline functions, g is the spline grid size, and h is the size of hidden layers. (Image from original research paper)

Similarly, both GKAN architectures achieved higher accuracy than the GCN on the first 200 features of the dataset, with the GKAN Architecture 2 achieving 67.66% accuracy compared to 61.24% for GCN.

Performance of different architectures on the first 100 features of the Cora dataset (Image from original research paper)

The training and test accuracy plots below showed that GKANs achieved higher accuracy during both the training and testing phases.

Training and Test Accuracy plots for different architectures (Image from original research paper)

It was also noted that GKAN architectures showed a sharper decrease in loss values during training and required fewer epochs to be trained.

Training and Test Loss for different architectures (Image from original research paper)

Influence Of Parameters on GKANs

Researchers also evaluated how different parameters impacted the performance of GKANs.

These parameters were:

k : the degree of the polynomial in the spline functions
g: the grid size for the spline functions
h: the size of hidden layers in the network

It was found that the following led to the most effective GKANs.

Lower polynomial degrees ( k = 1 out of {1, 2, 3})
Intermediate grid sizes (g = 7 out of {3, 7, 11})
Moderate hidden layer sizes (h = 12 out of {8, 12, 16})

Training Time

Although GKANs showed high accuracy and better efficiency with faster convergence, researchers noted that their training process was relatively slow, requiring future optimizations.

KANs have opened up a new avenue for improved graph learning and could also be a promising alternative for other graph learning approaches (including Graph Autoencoders, Graph Transformers, and more) that use MLPs at their core.

What are your thoughts on them? Have you used KANs in your projects yet? Let me know in the comments below!

Into AI

Discussion about this post

Into AI

Kolmogorov-Arnold Networks (KANs) Are Being Used To Boost Graph Deep Learning Like Never Before

A deep dive into how Graph Kolmogorov-Arnold Networks (GKANs) are improving Graph Deep Learning to surpass traditional approaches

But First, What Is Graph Deep Learning?

What Are Graph Convolutional Networks?

Next, What Are KANs?

The Birth Of GKANs

GKAN Architecture 1: Activations After Summation

GKAN Architecture 2: Activations Before Summation

The Performance Of GKANs

GKANs vs. GCN

Influence Of Parameters on GKANs

Training Time

Further Reading

Discussion about this post