Visualizing Attention in Transformers

And logging the ends in an experiment monitoring software

13 min learn

12 hours in the past

Neon orange, pink, and purple transformer model toy jumping mid-air and shooting at the camera in an action shot. Stylized image, comic-book like. Cover image for Comet ML’s article “Explainable AI: Visualizing Attention in Transformers” — Photo by Jeffery Ho on Unsplash, edited by writer.

In this text we discover one of the crucial fashionable instruments for visualizing the core distinguishing characteristic of transformer architectures: the eye mechanism. Keep studying to be taught extra about BertViz and how one can incorporate this consideration visualization software into your NLP and MLOps workflow with Comet.

Feel free to comply with together with the full-code tutorial right here, or, when you can’t wait, try the ultimate venture right here.

Transformers have been described as the only most essential technological growth to NLP lately, however their processes stay largely opaque. This is an issue as a result of, as we proceed to make main machine studying developments, we will’t all the time clarify how or why– which may result in points like undetected mannequin bias, mannequin collapse, and different moral and reproducibility points. Especially as fashions are extra incessantly deployed to delicate areas like healthcare, regulation, finance, and safety, mannequin explainability is crucial.

Horizontal bar chart showing gender and race projections for different professions, as predicted by a Word2Vec model (pre-transformer) — Gender and race projections throughout professions, as calculated by Word2Vec. These discovered biases may have a wide range of unfavorable penalties relying on the appliance of such a mannequin. Image from Bias in NLP Embeddings by Simon Warchal.

BertViz is an open supply software that visualizes the eye mechanism of transformer fashions at a number of scales, together with model-level, consideration head-level, and neuron-level. But BertViz isn’t new. In truth, early variations of BertViz have been round since as early as 2017.

So, why are we nonetheless speaking about BertViz?

BertViz is an explainability software in a subject (NLP) that’s in any other case notoriously opaque. And, regardless of its identify, BertViz doesn’t solely work on BERT. The BertViz API helps many transformer language fashions, together with the GPT household of fashions, T5, and most HuggingFace fashions.

BertViz visualization in the Comet UI for two different types of transformer models: an encoder-only distilbert transformer for question-answering and a decoder-only gpt-2 transformer for text generation — Despite its identify, BertViz helps all kinds of fashions. On the left, we visualize a question-answering process utilizing an encoder-only mannequin, and on the precise, a textual content technology process utilizing a decoder-only mannequin. GIF by writer.

As transformer architectures have more and more dominated the machine studying panorama lately, they’ve additionally revived an outdated however essential debate concerning interpretability and transparency in AI. So, whereas BertViz will not be new, its utility as an explainability software within the AI area is extra related now than ever.

To clarify BertViz, it helps to have a fundamental understanding of transformers and self-attention. If you’re already conversant in these ideas, be happy to skip forward to the part the place we begin coding.

We gained’t go into the nitty gritty particulars of transformers right here, as that’s a little bit past the scope of this text, however we are going to cowl a few of the fundamentals. I additionally encourage you to take a look at the extra sources on the finish of the article.

So, how, precisely, does a pc “be taught” pure language? In brief, they will’t– no less than in a roundabout way. Computers can solely perceive and course of numerical knowledge, so step one of NLP is to interrupt down sentences into “tokens,” that are assigned numerical values. The query driving NLP then turns into “how can we precisely cut back language and communication processes to computations?”

Some of the primary NLP fashions included feed-forward neural networks just like the Multi-Layer Perceptron (MLP) and even CNNs, that are extra popularly used right now for laptop imaginative and prescient. These fashions labored for some easy classification duties (like sentiment evaluation) however had a significant disadvantage: their feed-forward nature meant that at every time limit, the community solely noticed one phrase as its enter. Imagine making an attempt to foretell the phrase that follows “the” in a sentence. How many potentialities are there?

A visualization of the difficulty next-word sentence prediction for sequence models that don’t “remember” any context — Without a lot context, subsequent phrase prediction can grow to be extraordinarily tough. Graphic by writer.

To resolve this drawback, Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs like Seq2Seq) allowed for suggestions, or cycles. This meant that every computation was knowledgeable by the earlier computation, permitting for extra context.

This context was nonetheless restricted, nonetheless. If the enter sequence was very lengthy, the mannequin would are inclined to neglect the start of the sequence by the point it acquired to the tip of the sequence. Also, their sequential nature didn’t permit for parallelization, making them extraordinarily inefficient. RNNs additionally suffered notoriously from exploding gradients.

Transformers are sequence fashions that abandon the sequential construction of RNNs and LSTMs and undertake a totally attention-based strategy. Transformers have been initially developed for textual content processing, and are central to comparatively all state-of-the-art NLP neural networks right now, however they may also be used with picture, video, audio, or just about another sequential knowledge.

The key differentiating characteristic of transformers from earlier NLP fashions was the eye mechanism, as popularized within the Attention Is All You Need paper. This allowed for parallelization, which meant sooner coaching and optimized efficiency. Attention additionally allowed for a lot bigger contexts than recurrence, which means transformers may craft extra coherent, related, and sophisticated outputs.

The original transformer architecture, as visualized in the 2017 paper that made them famous, Attention Is All You Need. — The unique transformer structure, as visualized within the 2017 paper that made them well-known, Attention Is All You Need.

Transformers are made up of encoders and decoders, and the duties we will carry out with them depend upon whether or not we use both or each of those elements. Some frequent transformer duties for NLP embrace textual content classification, named entity recognition, question-answering, textual content summarization, fill-in-the-blanks, subsequent phrase prediction, translation, and textual content technology.

Chart showing the three different types of transformers: encoder-only, decoder-only, and encoder-decoder models. Chart also lists tasks specific to each type of transformer, as well as examples and alternative names. — Transformers are made up of encoders and decoders, and the duties we will carry out with them depend upon whether or not we use both or each of those elements. Note that there are additionally “sequence-to-sequence” fashions that don’t use transformers. Graphic by writer.

You’ve in all probability heard of Large Language Models (LLMs) like ChatGPT or LLaMA. The transformer structure is a elementary constructing block of LLMs, which use self-supervised studying on huge quantities of unlabelled knowledge. These fashions are additionally typically known as “basis fashions” as a result of they have a tendency to generalize effectively to a variety of duties, and in some circumstances are additionally obtainable for extra particular fine-tuning. BERT is an instance of this class of mannequin.

A graphic showing the relationship between transformer architectures, foundation models, and large language models. Graphic includes (as examples): ViT, BLOOM, BERT, Falcon, LLaMA, ChatGPT, and SAM (Segment Anything Model). — Not all LLMs or basis fashions use transformers, however they often do. Not all basis fashions are LLMs, however they often are. Not all transformers are LLMs or FMs. The essential takeaway is that each one transformer fashions use consideration. Graphic by writer.

That’s a variety of data however the essential takeaway right here is that the important thing differentiating characteristic of the transformer mannequin (and by extension all transformer-based foundational LLMs) is the idea of self-attention, which we’ll go over subsequent.

Generally talking, consideration describes the power of a mannequin to concentrate to the essential components of a sentence (or picture, or another sequential enter). It does this by assigning weights to enter options primarily based on their significance and their place within the sequence.

Remember that spotlight was the idea that improved the efficiency of earlier NLP fashions (like RNNs and LSTMs) by lending itself to parallelization. But consideration isn’t nearly optimization. It additionally performs a pivotal function in broadening the context a language mannequin is ready to think about whereas processing and producing language. This allows a mannequin to provide contextually acceptable and coherent texts in for much longer sequences.

A graphic showing the BertViz representation of the sentence “the animal didn’t cross the street because it was too scared.” The last word, scared, was predicted by the GPT-2 model. The graphic shows that GPT-2 correlates “it” to the animal. — In this instance, GPT-2 completed the enter sequence with the phrase “scared.” How did the mannequin know what “it” was? By analyzing the eye heads, we be taught the mannequin related “it” with “the animal” (as an alternative of, for instance, “the road”). Image by writer.

If we break transformers down right into a “communication” section and a “computation” section, consideration would signify the “communication” section. In one other analogy, consideration is so much like a search-retrieval drawback, the place given a question, q, we need to discover the set of keys, ok, most much like q and return the corresponding values, v.

Query: What are the issues I’m on the lookout for?
Key: What are the issues that I’ve?
Value: What are the issues that I’ll talk?

A visualization of how to calculate attention for transformers — Visualization of consideration calculation. Image from Erik Storrs.

Self-attention refers to the truth that each node produces a key, question, and a worth from that particular person node. Multi-headed consideration is simply self-attention that’s utilized a number of instances in parallel with totally different initialized weights. Cross-attention signifies that the queries are nonetheless produced from a given decoder node, however the keys and the values are produced as a perform of the nodes within the encoder.

This is an oversimplified abstract of transformer architectures, and we’ve glossed over fairly a couple of particulars (like positional encodings, section encodings, and consideration masks). For extra data, try the extra sources under.

Transformers will not be inherently interpretable, however there have been many makes an attempt to contribute post-hoc explainability instruments to attention-based fashions.

Previous makes an attempt to visualise consideration have been usually overly difficult and didn’t translate effectively to non-technical audiences. They may additionally range vastly from venture to venture and use-case to use-case.

A compliation of some very confusing and complicated previous attempts to visualize attention — Previous makes an attempt to visualise consideration weren’t standardized and have been usually overly complicated. Graphic compiled by writer from Interactive visualization and manipulation of attention-based neural machine translation (2017) and A Visual Debugging Tool for Sequence-to-Sequence Models (2018).

Some profitable makes an attempt to clarify consideration habits included attention-matrix warmth maps and bi-partite graph representations, each of that are nonetheless used right now. But these strategies even have some main limitations.

A graphic showing some methods of visualizing transformer attention other than BertViz — The attention-matrix heatmap (left) exhibits us that the mannequin shouldn’t be translating word-for-word, however contemplating a bigger context for phrase order. But it’s lacking a variety of the finer particulars of the eye mechanism.

BertViz finally gained reputation for its capability as an instance low-level, granular particulars of self-attention, whereas nonetheless remaining remarkably easy and intuitive to make use of.

GIF of BertViz Attention Head View, selecting transformer later and attention format type, and selecting specific attention heads, as visualized in Comet ML — BertViz finally gained reputation for its capability as an instance low-level, granular particulars of self-attention, whereas nonetheless remaining remarkably easy and intuitive to make use of. GIF by writer

That’s a pleasant, clear visualization. But, what are we really ?

BertViz visualizes the eye mechanism at a number of native scales: the neuron-level, consideration head-level, and model-level. Below we break down what meaning, ranging from the bottom, most granular stage, and making our method up.

A graphic showing the model view, attention head view, and neuron view of a transformer model using BertViz — BertViz visualizes consideration at a number of scales, together with the mannequin stage, consideration head stage, and neuron layer. Graphic by writer.

We’ll log our BertViz plots to Comet, an experiment monitoring software, so we will evaluate our outcomes afterward. To get began with Comet, create a free account right here, seize your API key, and run the next code:

We can set-up Comet in simply three strains of code.

Visualizing consideration in Comet will assist us interpret our fashions’ choices by exhibiting how they attend to totally different components of the enter. In this tutorial, we’ll use these visualizations to check and dissect the efficiency of a number of pre-trained LLMs. But these visualizations may also be used throughout fine-tuning for debugging functions.

To add BertViz to your dashboard, navigate to Comet’s public panels and choose both ‘Transformers Model Viewer’ or ‘Transformers Attention Head Viewer.’

GIF showing how to add transformer model view of BertViz visualization to Comet UI dashboard. — To add BertViz to your Comet dashboard, choose it from the general public panels and modify your view to your liking. GIF by writer.

We’ll outline some capabilities to parse our fashions outcomes and log the eye data to Comet. See the Colab tutorial to get the total code used. Then, we’ll run the next instructions to start out logging our knowledge to Comet: