A survey on: Application of Transformer in Computer Vision

In the past few years, convolutional neural networks have been considered the mainstream network for processing images. Transformer first proposed a brand new deep neural network in 2017, based mainly on the self-attention mechanism, and has achieved amazing results in the field of natural language processing. Compared with traditional convolutional networks and recurrent networks, the model is superior in quality, has stronger parallelism, and requires less training time. Because of these powerful advantages, more and more related workers are expanding how Transformer is applied to computer vision. This article aims to provide a comprehensive overview of the application of Transformer in computer vision. We first introduce the self-attention mechanism, because it is an important component of Transformer, namely single-headed attention mechanism, multi-headed attention mechanism, position coding, etc. And introduces the reformer model after the transformer is improved. We then introduced some applications of Transformer in computer vision, image classification, object detection, and image processing. At the end of this article, we studied the future research direction and development of Transformer in computer vision, hoping that this article can arouse further interest in Transformer.


Introduction
Transformer is a new model architecture proposed in a 2017 paper "Attention is All You Need" (1) . In this paper, experiments were conducted only on the scenario of machine translation, which completely defeated the SOTA model at the time. On WMT 2014 English-to-German translation task achieved 28.4 BLEUs, which is more than 2 BLEUs higher than the best existing results. And because the encoder side is calculated in parallel, the training time is greatly shortened.
Its pioneering thought overturned the previous idea of equating sequence modeling and RNNS, and is now widely used in various fields of NLP. The language models that are currently in full bloom in various NLP businesses, such as GPT (2) , BERT (3) , Roberta (4) , XLNet (5) are all based on the Transformer model. Therefore, it is particularly important to clarify every detail inside the Transformer model.
Machine translation started from RNN (6) and entered the era of neural network machine translation. The following briefly introduces the difference between RNN and Transformer. Transformer is a Sequence to Sequence model. The special feature is that it uses a lot of self-attention. In dealing with end-to-end models (7) , the most commonly used is the RNN. Its input is a string of vector sequences, and its output is another string of vector sequences, as shown in the Figure 1 below. RNN is very good at handling situations where input is a sequence. The problem is that RNN is not easy to be parallelized, which is not conducive to large-scale and fast training and deployment. Transformer abandoned the traditional CNN and RNN and put the attention mechanism into full play. The entire network structure is entirely composed of the attention mechanism, which is a relatively big improvement. His input and output are the same as RNN and are end-to-end. Every output of his has seen the entire input, but the magic is that every input of his can be calculated in parallel.
More and more researchers engaged in computer vision pay attention to Transformer, which has led to a rapid increase in the number of improved models based on Transformer. According to the survey, there have been 800 searches in arxiv using transformer keywords in 2021. After investigation, most of the literature reviews on Transformer mainly focus on the field of NLP (26,27) or generic attentionbased methods (28) . This article aims to provide a comprehensive overview of the computer vision application for Transformer model development. This article takes the emerging field of visual Transformer as the research object, we first introduce the prominent concept and self-attention mechanism of the transformer network, introduce the reformer model after the transformer is improved, and then introduce the specific situation of the recent visual transformer in detail, which is also the main content of this document. At the end of this study, several issues to be studied are detailed and prospects for possible future work are given.

Foundations of Transformer
As shown in the figure 2, the overall structure of transformer includes six encoders and six decoders with the same structure (29) . The design idea of each encoder and decoder is the same. Each decoder and encoder have two identical sub layers: self-attention and feed-forward neural network, and the decoder has an additional encoder-decoder attention layer. For convenience, all sub layers and embedded layers in the model generate output with model d =512 dimensions. So how does self-attention work?

Positional Encoding
First of all, the input data needs to be transformed into 512-dimensionals vector, but it is not enough to input the vector after word embedding into the encoder, because there is no loop structure similar to RNN in the transformer, and there is no ability to capture the sequence, or no matter  (29) how the sentence structure is disturbed, the transformer will get similar results. In order to solve this problem, position encoding vector is introduced to represent the distance between two words. In short, position encoding vector is added to the word vector. The sin cos function is used here, which makes it easy to understand the relative position of the model. Specifically, the following rules are used for coding: Where pos represents the position of the word in the sentence, and i represents the current dimension of the position code.

Scaled Dot-Product Attention
In the self-attention layer, suppose that our input is a sequence. Each input is multiplied by a matrix  to get embedding, in order to get an embedded with a dimension of 512. Then the embedding goes into the self-attention layer. Each vector is multiplied by three different transformation Next is the attention process of Q and K then divided by the square root of the dimension, making the gradient more stable. Then the result is passed through softmax, which determines the contribution of each word to the encoding current position, and finally multiplies with V 22 to get the weighted value matrix.

( )
This calculation is obviously the calculation process of the attention mechanism mentioned earlier. The coding output of each input word will introduce the coding information of the remaining words through the attention mechanism.

Multi-Head Attention
Although the single-head attention mechanism is very good, in practice, we usually focus on information in multiple locations. Although each code is more or less reflected in it, it may be dominated by the actual word itself. It will affect the attention of other equally important positions. On the basis of single-headed attention, a mechanism called "multi-headed" attention is proposed to further improve the self-attention layer and improve the performance of the attention layer. The specific mechanism is as follows: It gives multiple "representation subspaces" of the attention layer. For the "multi-head" attention mechanism, there are multiple query/key/value weight matrix sets (Transformer usually uses 8 attention heads, so we have eight matrix sets for each encoder / decoder).
From the figure (3), we can see that Multi-Head Attention contains multiple Self-Attention layers. Since there are a total of 8 self-Attention layers, we use

Specific process of encoding and decoding
The encoder is responsible for mapping the input sequence to the hidden layer after position encoding, and then the decoder maps the hidden layer to the output sequence. As is shown in Figure 4. The encoder consists of four parts. The first position of the encoder is to convert the input data into vector, and then input it to the multi attention system after position encoding. Compared with the RNN Fig. 3. Multi-head attention. The image is from (1). sequential input mentioned above, the transformer method does not need to input the data one by one, it can directly input in parallel, and store the position relationship between the data, which greatly improves the calculation speed and reduces the storage space. In the second part, the purpose of multi attention is to obtain the correlation within the data, which also makes up for the lack of correlation in CNN method. The third part is residual connection and standardization. In the process of mapping relationship transformation, there are always residual errors generated by calculation, and the existence of residual errors will make the mapping relationship of model learning more and more imprecise because of the increase of network layers. Therefore, the third part, multi head attention, also includes an add & norm layer, Add means residual connection to prevent network degradation, and norm means layer normalization to normalize the activation value of each layer. Finally, through the forward feedback layer composed of two fully connected layers, the addition of FFN introduces nonlinearity (relu activation function), which transforms the space of attention output and increases the performance of the model. In this way, the learning results obtained by the encoder are more accurate and representative.

Reformer
Transformer is characterized by multi head attention mechanism, feedforward neural network, position coding and residual connection. These different components make each element interact with global information like CNN, ignore distance, and has good scalability and scalability. It breaks through the limitation that RNN model cannot be calculated in parallel. Through parallel computing, the training speed is accelerated and self-attention can be generated to produce more interpretable model. We can check the attention distribution from the model. Attention head can learn to perform different tasks, and it has been widely used in the context semantics of natural language processing (30) .
Although the transformer model can produce very good results and is used in more and more long sequences, such as 11K text, many of these large models can only be trained on large industrial computing platforms, and cannot run on a single GPU, because their memory requirements are too large. For example, the complete gpt-2 model (31) contains about 1.5b parameters. The maximum number of parameters is more than 0.5B per layer, and there are 64 layers. In view of some problems existing in the party, Kitaev et al. Optimized some of the structures and proposed a new model reformer (32) .

LSH Attention
In order to solve the problem that the memory required by transformer is too large, reformer improves its attention part and replaces the Scaled Dot-Product Attention with the locality sensitivity hashing attention. Transfomer's multiheads attention is parallel computing and superposition. Assuming that the length of the sequence in the experiment is L, because of T QK  , it will form a matrix of L * L. It computes the attention score between two data points and needs to connect multiple self-attention points, which results in a large amount of computation and a large amount of memory. In reformer, locality sensitivity Hashing attention is selected to replace multi-heads attention. The basic idea of locally sensitive hashing is shown in Figure 5.
LSH (locality sensitivity hashing) algorithm is a similarity search algorithm in massive data. A special hash function is designed to map two data with high similarity to the same hash value with high probability. We call this function LSH. The most fundamental function of LSH is to efficiently deal with the nearest neighbor problem of massive high-dimensional data. As shown in the figure5, data with similar q , 1 k , 2 k attention scores will be mapped to the same space, and other data with little correlation will be mapped to the other space. After mapping here, q 1 k nearest and similar data will be divided into the same space again, and other data will also be sorted according to the attention score scores, which facilitates data searching and reduces the amount of calculation. Therefore, the computational complexity is greatly reduced, and the efficiency of long sequence training is improved (33) .

Reversible residual network
Although LSH attention solves the problem of too much computation, because of the back-propagation calculation in the training network, a single layer of network model usually needs several GB of memory, but because transformer is a multi-layer model with gradient descent, the activation value of each layer is often retained and used in the next layer. Typical transformer model needs at least 12 or more layers due to FFN and multi-heads attention layer. Therefore, if it is used to cache the value of each layer, the memory is far from enough. Therefore, reformer improves the Transformer model, and proposes that the reversible residual is used instead of the standard residual layer (34) to reduce the memory consumption. In reverse propagation, only the activation q k1 k2 Fig. 5. The basic idea of locally sensitive hash. 24 value of the last layer needs to be stored. The activation value of the last layer of the network can recover the activation value of any layer in the middle. The basic diagram of reversible residual layer is shown in Figure 6 [34]. As can be seen from figure 6, the left side of the figure is the standard residual layer. The activation value of each layer should be reserved for the next layer update, so it needs more memory. On the right of the figure is the reversible residual layer, which divides an input into two parts, 1 x and 2 x . Here, it is divided into two channels. In the first channel, 1 x remains unchanged, plus the value of 2 x after attention, the first output 1 y is obtained. In the second channel, 2 x remains unchanged, plus the value of 1 y after ffn, the second output 2 y is obtained. As can be seen from the right of the figure, after we know 1 y and 2 y , we can get 2 x by subtracting 1 y from ffn. Similarly, we can get 1 x . We only need to know the activation value of the top layer, then we can know all the activation values of the lower layer. In this way, when training the network, many applications will not be limited because the activation value takes up a lot of memory.

Visual transformer
After understanding the standard transformer for NLP, most image researchers find it very simple to use transformer in the field of vision. Therefore, in this chapter, we will introduce the overview of several transformers designed for visual tasks. We will make different classifications to explain these research directions.

Transformers for Image Classification
This paper is different from the traditional image classification, does not use the traditional CNN, as far as possible the NLP domain transformer does not modify the migration to the CV domain. Experimental results show that only using transformer in large-scale training set, the performance of classification task is very good. After the pretrained model of large-scale data sets is transferred to the classification task of small data sets, the performance is also very good. The following figure 7 shows the architecture of the vit model (11) .
Because the language data processed by NLP is serialized, the image data processed in practice is threedimensional (height, width and channels). Therefore, we need to preprocess the image, block and reduce the dimension, because the transformer wants to input a twodimensional matrix (N, D), where n is the length of the sequence and D is the dimension of each vector of the sequence. Therefore, we need to transform these 3D data into serialized data by block method.
In fact, P is the resolution of each image after blocking.
To convert it into 2D input, we need to add another step of patch embedding. The method is to do a linear transformation (i.e. full connection layer) for each vector, and the compressed dimension is D, which is called patch embedding here. Combined with position coding, the formula is as follows:  (4) Because vit only uses the encoder of transformer, for the classification task, assuming that there are nine input and nine output coding vectors, it is impossible to determine which vector to classify. Therefore, we need to add a learnable embedding vector to find the image categories corresponding to the other nine input vectors.

Transformers for Image Classification
The task of object detection is to predict the category of a series of objects in a given picture and the coordinates of the bounding box. Most of the existing detectors construct the problem into an unordered set prediction behavior by using some dispose, anchor or windows. In view of this problem (36) , Carion presents the DETR model (35) , improves the transformer model, and applies transformers to the object detection field for the first time. When inputting a picture, the feature of the picture is extracted by combining with CNN, and then it is transformed into a feature sequence and input into the transformer model. The disordered set with a specified length of B is obtained. Each set includes the category and coordinate of the object. As the first way to use end to end, the target detection has achieved good results. The following figure 8 shows the overall framework of DETR.
First, DETR starts from a CNN backbone to extract features from the input image. The position information is used to supplement the image features, the fixed position coding is added to the flat features, and then sent to the encoding and decoding transformer. The transformer decoder uses the embedding from the encoder and N learned position coding, and generates N output embedding. The final prediction is calculated by a simple feedforward network (FFN), which includes frame coordinates and class labels to indicate whether the object is of a specific class or not.
The key lies in two parts, the first part is in the design, the first is to use the encoder decoder architecture of transformer to generate N box predictions at one time. Where n is a preset integer whose ratio is much larger than the number of objects in the image. The second is to design bipartite matching loss, which calculates the size of loss based on bipartite graph matching between predicted boxes and ground truth boxes, so that the position and category of predicted box are closer to ground truth.

Transformers for Image Processing
IPT (17) is the first pre-training model that uses the powerful representation ability of transformer to deal with the application of de-noising, de rain and super score in the underlying visual tasks. However, due to the limited data set to meet specific tasks. Light, weather, camera and other factors will affect the quality of data. However, in order to meet the multi task, we do not know which type of image processing task to request before the test image comes out, because we must match different heads and tails to complete the corresponding underlying visual tasks. Since IPT is suitable for three tasks, it needs three heads and tails and a shared body.
As shown in Figure 9, the IPT consists of multiple headers, an encoder, a decoder, and multiple tails. For different image processing tasks, multi head and multi tail structures and task embedding are introduced. The head is used to extract features from input damaged images (such as noisy images and low resolution images). These features are divided into patches, which are input into the encoder decoder architecture. The codec converter is used to recover the lost information in the input data, and then the output is reshaped to features of the same size. IPT pre-trains by using the processed ImageNet dataset. IPT greatly improves the performance of image processing tasks. The pre-training model can be effectively used for the required tasks after fine tuning.

Conclusions
As a powerful deep neural network model, transformer is firstly used to process sequences in natural language. Compared with CNN, it has been widely used in computer vision. This paper summarizes the self-attention mechanism, and also focuses on the transformer structure composed of Fig. 9. The architecture of IPT. The image is from (36). 26 self-attention mechanism. Meanwhile, it also introduces the efficient reformer model improved by transformer. This paper analyzes several important applications of attention mechanism in the development of deep learning algorithm in the field of CV, such as image classification, target detection and so on, Image processing and other attention mechanism algorithms have good performance and high research significance, hoping to promote the further development of transformer.

Future Prospects
In order to promote the development of visual transformer, we provide several potential directions for future research.
(1) At present, some work has been done to simplify the structure of transformer, but the practical application to hardware is very difficult, so it is necessary to lighten the transformer in the future.
(2) Most of the data sets used by transformer are public data sets, but the number of data sets used in practice is small. In the future, we need to improve how to apply to the transformer with small amount of data.
Based on the attention mechanism algorithm, combined with its own shortcomings and continuous improvement, transformer will be applied to more fields to explore more possibilities