Attention is All You Need

Language is made of sequence of words (tokens). When a token turns around and reciprocate the attention of a previous token – magic happens!! Success of language models is from modeling this two-way or biderectional attention.

Now the question comes … how does that magic happens, under the hood? Well the answer is ‘Transformers’ which are 12 layers of Neural Networks stacked one over the other, which has seen millions of sequence of words from actual English books. Now here is a thing: consider 4 words from English if you throw them in the air they can mathematically be arranged or permuted in 24 ways. Out of which only few sequences will make sense and those are the ones which are present in the actual English Text. These actual text serves as ground truth providing the basis for Transformers to learn sensible sequence of words i.e. the ability to predict next word, then the next and the next and thus the whole language. Eventually this knoledge is stored as numbers inside a huge embedding matrix.

“Dilwale Dulhaniy Le Jayengey”, https://en.wikipedia.org/wiki/Dilwale_Dulhania_Le_Jayenge; 20 October 1995; Yashraj Films; Yash Chopra

“Attention Is All You Need”, https://arxiv.org/abs/1706.03762; 20 October 1995; Cornell University; Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

One response to “Attention is All You Need”

  1. Kritika

    Your ability to distill complex ideas into digestible nuggets makes your blog a must-read.

Leave a Reply

Your email address will not be published. Required fields are marked *