Physics of Language Model || Vision-Transformer Biomedical Image part 2

7 min readJun 4, 2023

The Mathematics Mechanism of Transformers Behind Self-Attention All you need Respective in Computer-Vision

During my investigation about Vision-Language model which i started with understanding how does Attention All you need works Mathematically and have better idea how to implement from Scratch you can follow my previous Article when i talked about about (SAM) in-depth .

in this article we going to explain how can we implement Transformers Architecture but we will need to make some modification which means to able to implement this type of modeling that designed for purpose to solve most limitation of RNN and LTSM with look back into previous token and find the relation features in Data representation specially in Natural Language processing , in our case we will see how to build and create Transformer which allow us to learn from Image prescriptive .

first of all let us explain the different between applying Transformers in Vision and NLP to make sense with that have to decide what is the type of task we are trying to solve which means based on the Task model Design changed for example in Generative Prompt Token (GPT) only we use Encoder part and stack them top on each other to provide more powerful feature extraction and keep in mind that will increase the size parameters of the model , last research from Meta AI LLaMA shows that the size of parameters doesn’t impact on results which is performed better than ChatGPT

now let us split out content to understand how to implement Transformers in Vision.
1. Encoder-Decoder
2. Encoder all we need
3. How to feed Image in Transformers
4. Build Final Stage of Classification MLP
5. Combine all the Components to build Vision-Transformer

1. Encoder-Decoder

the type of model Architecture such Auto-encoder or Variational Auto-Encoder (AVE) based on model pattern Encoding-Decoding reason behind is we try to map the Data representation into different Space in case of The transformer uses an encoder-decoder architecture. The encoder extracts features from an input sentence, and the decoder uses the features to produce an output sentence (translation).

Encoder-decoder architectures can handle inputs and outputs that both consist of variable-length sequences and thus are suitable for seq2seq problems such as machine translation. The encoder takes a variable-length sequence as input and transforms it into a state with a fixed shape. The decoder maps the encoded state of a fixed shape to a variable-length sequence

2. Encoder all we need

in Language model in case of generating next token such BERT (Bidirectional Encoder Representations from Transformers) model in their implementation they only used Encoder Part to learn the Features without include Decoder , with adding intuition idea [CLS] .

In order to better understand the role of [CLS] let’s recall that BERT model has been trained on 2 main tasks:

Masked language modeling: some random words are masked with [MASK] token, the model learns to predict those words during training. For that task we need the [MASK] token.
Next sentence prediction: given 2 sentences, the model learns to predict if the 2nd sentence is the real sentence, which follows the 1st sentence. For this task, we need another token, output of which will tell us how likely the current sentence is the next sentence of the 1st sentence. And here comes the [CLS]. You can think about the output of [CLS] as a probability.

and this CLS add with input Embedding Matrix and is learnable and winch also include Positional-encoding , Moreover they stacked Encoder by N Step to learn more contextual relations between words generalization from Patch embedding Input Tokens from here Transformer-Encoder comes from .

now after explained why only we need Encoder Part the rest of all the Transformers Architecture stay the same we will need only to have an idea how to re-Design the model at some stage such output Layer and input layer because Transformer Create for Purpose of NLP Task which means we need to make the model able to handle Image and perform task such Classification or segmentation based on Framework modeling we dealing with in our case we will use Transformer-Encoder to do Classification Task General Overview of the Architecture of Vision-Transformer is show in the figure Blow

3. How to feed Image into Transformers

The major changing in Transformers to make able to handle Image is Patch Embedding which included Positional Encoding that giving us Terminology of Positional Embedding in following this section we will dive deeply in First layer so-called Liner Projection that Split the image into sub-set of Patch and each patch is taken by model as token reason behind that is Attention mechanism design to learn from Seq2Seq modeling Task is more general in contextual Representation of word vectorization to learn relevant relation between words .in matter of case we need make the Images be observable by the model .

3.1 Liner Projection

The Patch Embedding is a process used in the Vision Transformer (ViT) to transform the image into a sequence of fixed-length embedding that can be processed by the Transformer architecture. the input image is divided into a grid of non-overlapping patches, which are then linearly embedded into a lower-dimensional space, the embedding are then concatenated and a learnable position embedding is added to each of them to represent their and a positional embedding is then added to these vectors (tokens). The positional embedding allows the network to know where each sub-image is positioned originally in the image. Without this information, the network would not be able to know where each such image would be placed, leading to potentially wrong predictions.

The resulting sequence of embedding is then fed as input to the Transformer encoder. This method allows the model to process the image as a sequence, enabling it to capture the local and global features of the image simultaneously. The size of the patches and the dimension of the embedding can be tuned to optimize the performance of the model for a given task

the code Example :

when we adding the Positional Embedding vector into each Patch that allow the model as we know where the Patch is localize in Space Dimension

code Example :

4. Build Final Stage of Classification MLP

in most case when we dealing with Classification problem we use MLP stand of Multi-Layer Perception is Linear Layer takes dimension Patch and number of classes and to ensure that dimension Patch is N Token extract from length of sequence patch corresponding to CLS given to it

code example

5. Combine all the Components to build Vision-Transformer

now after we explained the different between use cases of Transformers in both NLP and vision and we know now there’s no big different only few additional blocks added to make the model able to process images as Sequences of Token by Patching the images in Patches and Each patch processed as token in Patch embedding Matrix

implementation

to demonstrated how to use Vision-Transformer i did Full code and is available at Github repo here will show the results and building blocks of TransfomerEncoder

code :

Transformer-Encoder

results:

Data used in this experiment collected from lab of University is Coronary Artery disease Please Visit the repo for full explain how to run the code

Summary

after we understand how to build Vision-Transformer we need to have a good ideas which are relate to limitation of this type of model was created and designed specially for NLP Task , in Case of Vision task this Architecture required a lot data to train on which means is Model Data Hungry whihout provide a lot of samples model will not able to learn in-depth and extract the features moreover if you have a low Data resource in better chose to stand out with ResNet or Convolution , but in other case this Article shows how to impediment Transformer in vision that will lead into more generalize task in Parallel with Language and Vision , form here we can say Large Lnagauge model are powerful modeling AI system can learn and describe in-depth the contextual language and discover relevant description about Task and solving problems , this same goes with Vision-Language model will discuss in the next article .

References :

Vision Transformer paper : https://arxiv.org/abs/2010.11929
Transformers chapter from the book “Dive into Deep Learning” (DL2): This chapter covers the mathematical concepts and implementation details of Transformers. It is a valuable resource for understanding the theory behind Transformers.: https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html
The repository contains a full implementation : Vision-Transfomer in-depth using PyTorch-Lighitng . : https://github.com/deep-matter/VisionTransformer-Coronary-Artery
LLaMa Paper : https://arxiv.org/abs/2302.13971
Patch Embedding and Positional Encoding : https://gowrishankar.info/blog/transformers-everywhere-patch-encoding-technique-for-vision-transformersvit-explained/