Understanding Transformers and Attention Mechanisms
Do I have your attention? Well, what does it really mean to have your attention? Attention is defined as “notice taken of someone or something; the regarding of someone or something as interesting or important.” As humans, we are able to selectively concentrate on aspects of - let’s say - an image, and efficiently understand context while simultaneously drawing conclusions. Rather than looking over every pixel/detail of an image, the human brain chooses to focus on key objects first.
This human trait of “attention” is mimicked in computer vision tasks. Rather than having to scan the entire image, computer vision enables computers to focus on relevant areas.
Above, attention is focused on the highlighted areas to output a certain word. “A woman is throwing a frisbee in the park.”
This concept of attention can be applied to other fields of machine learning as well. Think of an elementary school science textbook. If someone were to ask you “What are the three main classifications of rocks?”, to find the answer you would refer to the chapter about Rocks and not read the entire textbook cover to cover. This allows you to find a specific answer rather than generalizing based on the entire book you just read. This same logic is how attention mechanisms work within natural language processing tasks.
In 2017, Ashish Vaswani and others on the Google Brain team released a new method of language processing called the “Transformer network” in a paper titled “Attention is All You Need.” The Transformer utilizes these attention mechanisms (hence the name “Attention is All You Need”) to process long sequences of data, i.e. long strings of text, faster and more accurately than before.
Rather than approaching language from the typical lens of processing each word one by one, Google’s team drew inspiration from attention mechanisms in Convolutional Neural Networks (CNN) and viewed bodies of text almost as if they were images themselves. “Attention is All You Need” used the key ideas of attention in CNNs to teach machines how to read, write, and understand the human language (Natural Language Processing or NLP) in a more efficient and accurate way than ever before. The Transformer sought to become a more successful computational tool that is used to manipulate, interpret, and generate language.
The Problems Addressed by the Attention Mechanisms
Long Term Dependency Problems of Previous Solution (the RNN)
A Recurrent Neural Network (RNN) is the structure that was originally used to capture timely dependencies in sequences/strings of text by processing each word one at a time in order. An RNN has an encoder/decoder structure. Think of the encoder and decoder as things speaking two languages, one unique language and one that they share. The encoder takes in the text, translates a summary into their shared language, and then the decoder translates it back into its own language.
The issue with this is that if the “summary” is bad, then the translation will be bad. RNNs have a “long term dependency problem”; the longer the text is (recall the textbook example), the worse the summary will be. Since RNNs process each word at a time, they also have a hard time remembering key information from earlier in the passage. This is called the vanishing gradient problem. For example, if a passage mentions in the beginning that a man is from America and later has the sentence: He speaks _______. RNNs have a hard time calling upon that earlier information to fill in the blank because it has no way to identify what is important to remember. Whereas humans would know the answer would most likely be “English.”
How Transformers Attempt to Solve the Long Term Dependency Problem with Attention Mechanisms
The breakthrough that propels Transformers ahead of previous methods of NLP is the usage of these attention mechanisms. Recall from earlier that attention mechanisms focus on keywords within a body of text rather than looking at all of the words with equal weight. So, in the instance of “the man was from America; what language does he speak?”, having attention mechanisms would take note of the word “America” and use that as context to figure out he speaks English. This addition helps fight the “vanishing gradient problem” that RNNs suffered from. Though a clear improvement from RNNs, research suggests that Transformers still don’t handle Long Term Dependencies that well; there is clearly more room for improvement.
Another difference to note of Transformers compared to RNNs is the way each method processes the language. In the past, the solution was to process each word one by one, similar to how a human reads. But, Transformers process all of the words at the exact same time, i.e. in parallel; this is called parallel computation. This drastically speeds up the processing time and makes it easier to train insanely large models on insanely large amounts of data.
Previous language processing solutions proved to be inefficient and didn’t consider the timely dependencies (i.e. when a word occurs in a sentence affects the meaning) found in language. In other words, to understand a sentence you can’t just simply count the number of times a word appears in a sequence. This is called the “Bag of Words” method that was once an attempted method of language processing.
To humans, the two sentences in the figure above have drastically different meanings. But, in the Bag of Words method, they appear to be identical. The obvious next step was to find a solution that accounts for the order of words in addition to the frequency of words.
Previous models passed in each word one by one, so there wasn’t an issue of knowing where the word was in the sentence. However, since Transformers pass the words in parallel, there had to be a new solution for providing context for words. When processing each word, the Transformer translates the word into something the computer can understand, then also adds a position reference onto the word. So, if there are multiple instances of the word “Omneky,” the position reference will let the computer know where each of those instances are within the text. This equation of word + position creates context for the computer.
Current Pretrained Language Models
Many models have built upon the initial Transformer laid out in Google’s “Attention is All You Need.” Every major company is training its own large language model. OpenAI has GPT/GPT-2/GPT-3, Google has BERT/ALBERT/XLNet/T5, Facebook has RoBERTa/XLM/BART, Microsoft has Turing-NLG, etc. As time goes on, companies are continuing to develop larger models. But, there is also an emphasis on attempting to create models that can run efficiently on commodity hardware and are accessible to the wider community.
You can play with open-sourced, pre-trained models with Huggingface here:
There is a lot of hype surrounding OpenAI’s GPT-3 API. OpenAI's GPT-3 API made OpenAI/Microsoft's gigantic Transformer model trained on the entire web accessible to regular developers. Users flocked to Twitter to showcase creative applications of the new model. For example, web developer Sharif Shameem (@sharifshameem) tweaked GPT-3 so it produced HTML code, tweeting out his results. By using only prompts, GPT-3 learned to produce web page layouts. The applications for creativity are endless.
Language Transformers and Omneky
Omneky’s goal is to utilize deep learning to level the “digital marketing playing field” between large and small companies. Richard Socher, Chief Scientist at Salesforce, states "Omneky is making transformers useful for all companies that want AI to help them with marketing." Using NLP and language transformers, Omneky drafts personalized ad copies that are guaranteed to drive conversations. Merging predictive analytics and text generation tools, Omneky’s software can help create personalized Facebook ad creatives with the click of a button. This allows Omneky to create and manage advertising campaigns at ¼ of the cost of traditional marketing companies. We are currently offering a two-week free trial to try our service free - just schedule a demo here!