Vision GNN, A Computer Vision Model Using Graph Structure

GNN 06/06/2023

3 main points
✔️ Proposal of a computer vision model "Vision GNN (ViG)" that represents images as a graph structure
✔️ Considers image patches as nodes and constructs a graph by connecting close patches to represent irregular and complex objects
✔️ Experiments on image recognition and object detection have demonstrated the advantage of the proposed We demonstrated the superiority of the proposed ViG architecture through experiments on image recognition and object detection.

Vision GNN: An Image is Worth Graph of Nodes
written by Kai Han,Yunhe Wang,Jianyuan Guo,Yehui Tang,Enhua Wu
(Submitted on 1 Jun 2022 (v1), last revised 4 Nov 2022 (this version, v3))
Comments: NeurIPS 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Convolutional Neural Networks (CNN) are the mainstay of computer vision and are used for various tasks such as image classification, object detection, semantic segmentation, etc. Since 2020, Transfomer has been introduced to computer vision and many variations such as pyramid Many variations have been proposed, including architecture, local attention, and position encoding. Furthermore, the introduction of Transfomer into computer vision has inspired the introduction of MLP.

One of the fundamental tasks of computer vision is to recognize objects in an image. Modern computer vision treats images as a sequence of square patches. However, objects are usually irregular in shape, making the grid and sequence structures used in traditional networks redundant, inflexible, and difficult to process.

VisionGNN, introduced here, analyzes the graph structure of objects in an image, and by viewing the image as a graph structure, it achieves flexible and effective processing.

Vision GNN

VisionGNN is a model that represents images as graph data and utilizes graph neural networks for visual tasks. The image is divided into several patches, which are considered nodes. By building a graph based on these nodes, irregular and complex objects can be better represented.

We will now explain how to convert images into graphs and the Vision GNN (hereafter referred to as ViG) architecture for learning visual representations.

ViG Block

Graph Structure of Image

For an image of size H x W x 3, we divided the image into N patches. Converting each patch into a feature vector x yields X = [ _x1, _x2, . , xN _]. These features can be regarded as an unordered set of nodes denoted as ν = _{v1,_v2, ..., _vN}. For each node _vi (i=1, 2, ..., N), find a K-neighborhood and add edges from _vj to _vi for all v ∈ N(vi ₎.

By capturing images as graph data, GNN's are used to extract their representation. The advantages of representing images as graphs are as follows

Graphs are generalized data structures, and grids and sequences can be viewed as special cases of graphs.
More than grids and sequences, graphs can model complex objects more flexibly.
An object can be viewed as a composition of parts (e.g., a human being is a head, upper body, arms, and legs), and a graph structure can build connections between those parts.
Advanced GNN research can be applied to vision tasks.

Graph-level processing

The graph convolution layer allows information to be exchanged between nodes by aggregating the features of neighboring nodes. As an aggregation operation, features of neighboring nodes are aggregated to compute a node's representation, and an update operation further integrates the aggregated features.

Here, max-relative graph convolution is adopted because of its convenience and efficiency,

and these graph-level processes can be expressed as X'=GraphConv(X).

VIG Block

Conventional GCNs typically use multiple graph convolution layers to extract aggregate features of graph data. However, the phenomenon of over-smoothing, which degrades node features, occurs and reduces image processing performance (see figure on the right).

Therefore, to solve this problem, the ViG block introduces more feature transformations and nonlinear activation. In this study, a linear layer is applied before and after graph convolution to pass node features through the same domain and increase feature diversity. It also inserts a nonlinear activation function after the graph convolution to avoid layer collapse. This upgraded module is referred to as the Grapher module.

Given an input feature X in the Grapher module, it can be expressed as follows

To further increase the feature transformation capability and mitigate the phenomenon of over-smoothing, a fed-forward network (FFN) is utilized at each node; the FFN module is a simple multilayer perceptron with two fully connected layers.

The stack of Grapher and FFN modules make up the ViG Block, which is the basic building block for constructing the network. This allows the construction of a ViG network based on a graph representation of the image and the proposed ViG Block.

Network Architechture

Transfomer is an isotropic architecture. used in the field of computer vision. CNNs also use the pyramid architecture. To compare with other neural networks, ViG builds these two types of network architectures.

Isotropic ViG

Isotropic architecture means that the main body has features of the same size and shape throughout the network. Three isotropic ViG architectures (ViG-Ti, ViG-S, and ViG-B) with different model sizes are constructed, and the number of nodes is set to N=196. The numbers of nodes are Ti:Tiny, S:Small, and B:Base, respectively. In order to gradually expand the receptive field, the number of neighbor nodes k is increased from 9 to 18 as the layer depth increases in these three models. The number of heads is set to h = 4 by default, and FLOPs is calculated for images with a resolution of 224 x 224 (Table 1).

Table 1: Variations of the Isotropic ViG Architecture

Pyramid ViG

The Pyramid architecture takes into account the multi-scale properties of the image by extracting features of progressively smaller spatial sizes as the layer depth increases. Because the Pyramid architecture is effective in image processing, four versions of the Pyramid ViG model were constructed (Table 2).

Table 2: Pyramid ViG Series Detailed Configuration

D: feature dimension, E: ratio of hidden dimensions in FNN, K: number of neighbors in GCN, H W: input image size

Positional encoding

To represent the location information of a node, a location encoding vector is added to each node feature.

The absolute position coding expressed in the above equation is adapted for both the Isotropic and Pyramid architectures.

experiment

In this study, we conducted experiments to demonstrate the effectiveness of the ViG model in image recognition and object detection. Here we show the results of image classification using ImageNet, object detection, and the graph structure of images constructed with ViG.

ImageNet

Isotropic ViG

Neural networks with the Isotropic architecture do not change the size of the features, making them easy to scale up and flexible for hardware acceleration. This method is widely used in the Transfomer model of natural language processing. Also, recent neural networks in image processing have also adopted this method.

Table 3 compares isotropic ViG with existing isotropic CNN, Transfomer, and MLP. Results show that ViG performs better than other types of networks. For example, ViG-Ti achieves 73.9%, 1.7% higher than the DeiT-Ti model at a similar computational cost in Top-.

Table 3: Results for ViG and Isotropic networks in ImageNet

Marks are respectively: spades: CNN, squares: MLP, diamonds: Transfomer, stars: GNN

Pyramid ViG

The Pyramid architecture progressively reduces the spatial size of the feature map as the network deepens, allowing for the generation of multiscale features that take advantage of the scale-invariant properties of images. advanced networks employ a pyramid architecture.

Table 4 compares Pyramid ViG with these representative pyramidal networks: the Pyramid ViG series performed as well as or better than state-of-the-art pyramidal networks including CNNs, MLPs, and transformers. This shows that graph neural networks perform well in visual tasks and have the potential to become a fundamental component of computer vision systems.

Table 4: Pyramid ViG and other pyramid networks results on ImageNet.

Marks are respectively: spades: CNN, squares: MLP, diamonds: Transfomer, stars: GNN

Object Detection

The ViG model is applied to an object detection task to evaluate its generalizability. For a fair comparison, we used ImageNet's pre-trained Pyramid ViG-S. The results in Table 5 show that Pyramid ViG-S performs better on both RetinaNet and Mask R-CNN than different representative types such as ResNet, CycleMLP, and Swin Transformer. This excellent result demonstrates the generalization capability of the ViG architecture.

Table 5: Results of object detection and instance segmentation in COCO val2017.

Visualization

To better understand the behavior of the ViG model, we have visualized the graph structure constructed in ViG-S.　In the figure below, we represent the graphs of two samples at different depths. The pentagram is the central node and the nodes of the same color are its neighbors.

In shallow layers, neighboring nodes tend to be selected based on low-level, local features such as color and texture. In deeper layers, the neighbors of the central node are more semantic and belong to the same category; the ViG network progressively links nodes by their content and semantic representation, helping to better recognize objects.

summary

How was it? In this article, we introduced models using graph structures, which we feel are not so familiar in computer vision. It turns out that graph structures can contribute to the development of computer vision from a new perspective by following the ideas of CNN, Transfomer, MLP, and others.

Graph structures have approaches that can be used for a variety of computer vision tasks. With even more research like this in the future, more accurate images and 3D objects will be produced, and the day may come when it will be impossible to distinguish between virtual space and reality.

Categories related to this article

tadano