Home » Visualizing Attention in Transformers | Generative AI

Visualizing Attention in Transformers | Generative AI

by Narnia
0 comment

And logging the ends in an experiment monitoring software

13 min learn

12 hours in the past

Neon orange, pink, and purple transformer model toy jumping mid-air and shooting at the camera in an action shot. Stylized image, comic-book like. Cover image for Comet ML’s article “Explainable AI: Visualizing Attention in Transformers”
Photo by Jeffery Ho on Unsplash, edited by writer.

In this text we discover one of the crucial fashionable instruments for visualizing the core distinguishing characteristic of transformer architectures: the eye mechanism. Keep studying to be taught extra about BertViz and how one can incorporate this consideration visualization software into your NLP and MLOps workflow with Comet.

Feel free to comply with together with the full-code tutorial right here, or, when you can’t wait, try the ultimate venture right here.

Transformers have been described as the only most essential technological growth to NLP lately, however their processes stay largely opaque. This is an issue as a result of, as we proceed to make main machine studying developments, we will’t all the time clarify how or why– which may result in points like undetected mannequin bias, mannequin collapse, and different moral and reproducibility points. Especially as fashions are extra incessantly deployed to delicate areas like healthcare, regulation, finance, and safety, mannequin explainability is crucial.

Horizontal bar chart showing gender and race projections for different professions, as predicted by a Word2Vec model (pre-transformer)
Gender and race projections throughout professions, as calculated by Word2Vec. These discovered biases may have a wide range of unfavorable penalties relying on the appliance of such a mannequin. Image from Bias in NLP Embeddings by Simon Warchal.

BertViz is an open supply software that visualizes the eye mechanism of transformer fashions at a number of scales, together with model-level, consideration head-level, and neuron-level. But BertViz isn’t new. In truth, early variations of BertViz have been round since as early as 2017.

So, why are we nonetheless speaking about BertViz?

BertViz is an explainability software in a subject (NLP) that’s in any other case notoriously opaque. And, regardless of its identify, BertViz doesn’t solely work on BERT. The BertViz API helps many transformer language fashions, together with the GPT household of fashions, T5, and most HuggingFace fashions.

BertViz visualization in the Comet UI for two different types of transformer models: an encoder-only distilbert transformer for question-answering and a decoder-only gpt-2 transformer for text generation
Despite its identify, BertViz helps all kinds of fashions. On the left, we visualize a question-answering process utilizing an encoder-only mannequin, and on the precise, a textual content technology process utilizing a decoder-only mannequin. GIF by writer.

As transformer architectures have more and more dominated the machine studying panorama lately, they’ve additionally revived an outdated however essential debate concerning interpretability and transparency in AI. So, whereas BertViz will not be new, its utility as an explainability software within the AI area is extra related now than ever.

To clarify BertViz, it helps to have a fundamental understanding of transformers and self-attention. If you’re already conversant in these ideas, be happy to skip forward to the part the place we begin coding.

We gained’t go into the nitty gritty particulars of transformers right here, as that’s a little bit past the scope of this text, however we are going to cowl a few of the fundamentals. I additionally encourage you to take a look at the extra sources on the finish of the article.

So, how, precisely, does a pc “be taught” pure language? In brief, they will’t– no less than in a roundabout way. Computers can solely perceive and course of numerical knowledge, so step one of NLP is to interrupt down sentences into “tokens,” that are assigned numerical values. The query driving NLP then turns into “how can we precisely cut back language and communication processes to computations?”

Some of the primary NLP fashions included feed-forward neural networks just like the Multi-Layer Perceptron (MLP) and even CNNs, that are extra popularly used right now for laptop imaginative and prescient. These fashions labored for some easy classification duties (like sentiment evaluation) however had a significant disadvantage: their feed-forward nature meant that at every time limit, the community solely noticed one phrase as its enter. Imagine making an attempt to foretell the phrase that follows “the” in a sentence. How many potentialities are there?

A visualization of the difficulty next-word sentence prediction for sequence models that don’t “remember” any context
Without a lot context, subsequent phrase prediction can grow to be extraordinarily tough. Graphic by writer.

To resolve this drawback, Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs like Seq2Seq) allowed for suggestions, or cycles. This meant that every computation was knowledgeable by the earlier computation, permitting for extra context.

This context was nonetheless restricted, nonetheless. If the enter sequence was very lengthy, the mannequin would are inclined to neglect the start of the sequence by the point it acquired to the tip of the sequence. Also, their sequential nature didn’t permit for parallelization, making them extraordinarily inefficient. RNNs additionally suffered notoriously from exploding gradients.

Transformers are sequence fashions that abandon the sequential construction of RNNs and LSTMs and undertake a totally attention-based strategy. Transformers have been initially developed for textual content processing, and are central to comparatively all state-of-the-art NLP neural networks right now, however they may also be used with picture, video, audio, or just about another sequential knowledge.

The key differentiating characteristic of transformers from earlier NLP fashions was the eye mechanism, as popularized within the Attention Is All You Need paper. This allowed for parallelization, which meant sooner coaching and optimized efficiency. Attention additionally allowed for a lot bigger contexts than recurrence, which means transformers may craft extra coherent, related, and sophisticated outputs.

The original transformer architecture, as visualized in the 2017 paper that made them famous, Attention Is All You Need.
The unique transformer structure, as visualized within the 2017 paper that made them well-known, Attention Is All You Need.

Transformers are made up of encoders and decoders, and the duties we will carry out with them depend upon whether or not we use both or each of those elements. Some frequent transformer duties for NLP embrace textual content classification, named entity recognition, question-answering, textual content summarization, fill-in-the-blanks, subsequent phrase prediction, translation, and textual content technology.

Chart showing the three different types of transformers: encoder-only, decoder-only, and encoder-decoder models. Chart also lists tasks specific to each type of transformer, as well as examples and alternative names.
Transformers are made up of encoders and decoders, and the duties we will carry out with them depend upon whether or not we use both or each of those elements. Note that there are additionally “sequence-to-sequence” fashions that don’t use transformers. Graphic by writer.

You’ve in all probability heard of Large Language Models (LLMs) like ChatGPT or LLaMA. The transformer structure is a elementary constructing block of LLMs, which use self-supervised studying on huge quantities of unlabelled knowledge. These fashions are additionally typically known as “basis fashions” as a result of they have a tendency to generalize effectively to a variety of duties, and in some circumstances are additionally obtainable for extra particular fine-tuning. BERT is an instance of this class of mannequin.

A graphic showing the relationship between transformer architectures, foundation models, and large language models. Graphic includes (as examples): ViT, BLOOM, BERT, Falcon, LLaMA, ChatGPT, and SAM (Segment Anything Model).
Not all LLMs or basis fashions use transformers, however they often do. Not all basis fashions are LLMs, however they often are. Not all transformers are LLMs or FMs. The essential takeaway is that each one transformer fashions use consideration. Graphic by writer.

That’s a variety of data however the essential takeaway right here is that the important thing differentiating characteristic of the transformer mannequin (and by extension all transformer-based foundational LLMs) is the idea of self-attention, which we’ll go over subsequent.

Generally talking, consideration describes the power of a mannequin to concentrate to the essential components of a sentence (or picture, or another sequential enter). It does this by assigning weights to enter options primarily based on their significance and their place within the sequence.

Remember that spotlight was the idea that improved the efficiency of earlier NLP fashions (like RNNs and LSTMs) by lending itself to parallelization. But consideration isn’t nearly optimization. It additionally performs a pivotal function in broadening the context a language mannequin is ready to think about whereas processing and producing language. This allows a mannequin to provide contextually acceptable and coherent texts in for much longer sequences.

A graphic showing the BertViz representation of the sentence “the animal didn’t cross the street because it was too scared.” The last word, scared, was predicted by the GPT-2 model. The graphic shows that GPT-2 correlates “it” to the animal.
In this instance, GPT-2 completed the enter sequence with the phrase “scared.” How did the mannequin know what “it” was? By analyzing the eye heads, we be taught the mannequin related “it” with “the animal” (as an alternative of, for instance, “the road”). Image by writer.

If we break transformers down right into a “communication” section and a “computation” section, consideration would signify the “communication” section. In one other analogy, consideration is so much like a search-retrieval drawback, the place given a question, q, we need to discover the set of keys, ok, most much like q and return the corresponding values, v.

  • Query: What are the issues I’m on the lookout for?
  • Key: What are the issues that I’ve?
  • Value: What are the issues that I’ll talk?
A visualization of how to calculate attention for transformers
Visualization of consideration calculation. Image from Erik Storrs.

Self-attention refers to the truth that each node produces a key, question, and a worth from that particular person node. Multi-headed consideration is simply self-attention that’s utilized a number of instances in parallel with totally different initialized weights. Cross-attention signifies that the queries are nonetheless produced from a given decoder node, however the keys and the values are produced as a perform of the nodes within the encoder.

This is an oversimplified abstract of transformer architectures, and we’ve glossed over fairly a couple of particulars (like positional encodings, section encodings, and consideration masks). For extra data, try the extra sources under.

Transformers will not be inherently interpretable, however there have been many makes an attempt to contribute post-hoc explainability instruments to attention-based fashions.

Previous makes an attempt to visualise consideration have been usually overly difficult and didn’t translate effectively to non-technical audiences. They may additionally range vastly from venture to venture and use-case to use-case.

A compliation of some very confusing and complicated previous attempts to visualize attention
Previous makes an attempt to visualise consideration weren’t standardized and have been usually overly complicated. Graphic compiled by writer from Interactive visualization and manipulation of attention-based neural machine translation (2017) and A Visual Debugging Tool for Sequence-to-Sequence Models (2018).

Some profitable makes an attempt to clarify consideration habits included attention-matrix warmth maps and bi-partite graph representations, each of that are nonetheless used right now. But these strategies even have some main limitations.

A graphic showing some methods of visualizing transformer attention other than BertViz
The attention-matrix heatmap (left) exhibits us that the mannequin shouldn’t be translating word-for-word, however contemplating a bigger context for phrase order. But it’s lacking a variety of the finer particulars of the eye mechanism.

BertViz finally gained reputation for its capability as an instance low-level, granular particulars of self-attention, whereas nonetheless remaining remarkably easy and intuitive to make use of.

GIF of BertViz Attention Head View, selecting transformer later and attention format type, and selecting specific attention heads, as visualized in Comet ML
BertViz finally gained reputation for its capability as an instance low-level, granular particulars of self-attention, whereas nonetheless remaining remarkably easy and intuitive to make use of. GIF by writer

That’s a pleasant, clear visualization. But, what are we really ?

BertViz visualizes the eye mechanism at a number of native scales: the neuron-level, consideration head-level, and model-level. Below we break down what meaning, ranging from the bottom, most granular stage, and making our method up.

A graphic showing the model view, attention head view, and neuron view of a transformer model using BertViz
BertViz visualizes consideration at a number of scales, together with the mannequin stage, consideration head stage, and neuron layer. Graphic by writer.

We’ll log our BertViz plots to Comet, an experiment monitoring software, so we will evaluate our outcomes afterward. To get began with Comet, create a free account right here, seize your API key, and run the next code:

We can set-up Comet in simply three strains of code.

Visualizing consideration in Comet will assist us interpret our fashions’ choices by exhibiting how they attend to totally different components of the enter. In this tutorial, we’ll use these visualizations to check and dissect the efficiency of a number of pre-trained LLMs. But these visualizations may also be used throughout fine-tuning for debugging functions.

To add BertViz to your dashboard, navigate to Comet’s public panels and choose both ‘Transformers Model Viewer’ or ‘Transformers Attention Head Viewer.’

GIF showing how to add transformer model view of BertViz visualization to Comet UI dashboard.
To add BertViz to your Comet dashboard, choose it from the general public panels and modify your view to your liking. GIF by writer.

We’ll outline some capabilities to parse our fashions outcomes and log the eye data to Comet. See the Colab tutorial to get the total code used. Then, we’ll run the next instructions to start out logging our knowledge to Comet:

Text technology instance

Question-answering instance

Sentiment evaluation instance

At the bottom stage, BertViz visualizes the question, key, and worth embeddings used to compute consideration in a neuron. Given a particular token, this view traces the computation of consideration from that token to the opposite tokens within the sequence.

In the GIF under, optimistic values are coloured blue and unfavorable values are coloured orange, with colour depth reflecting the magnitude of the worth. Connecting strains are weighted primarily based on the eye rating between respective phrases.

A short GIF demonstrating how to use BertViz to visualize the computations on a neuron-level of the attention layer for our transformer experiment in Comet ML.
The neuron view breaks down the calculations used to foretell every token, together with the key and question weights.

Whereas the views within the following two sections will present what consideration patterns the mannequin learns, this neuron view exhibits how these patterns are discovered. The neuron view is a little more granular than we have to get for this explicit tutorial, however for a deeper dive, we may use this view to hyperlink neurons to particular consideration patterns and, extra typically, to mannequin habits.

It’s essential to notice that it isn’t totally clear what relationships exist between consideration weights and mannequin outputs. Some, like Jain et al. in Attention Is Not Explanation, declare that normal consideration modules shouldn’t be handled as if they supply significant explanations for predictions. They suggest no alternate options, nonetheless, and BertViz stays one of the crucial fashionable consideration visualization instruments right now.

The attention-head view exhibits how consideration flows between tokens throughout the identical transformer layer by uncovering patterns between consideration heads. In this view, the tokens on the left are attending to the tokens on the precise and a spotlight is represented as a line connecting every token pair. Colors correspond to consideration heads and line thickness represents the eye weight.

In the drop-down menu, we will choose the experiment we’d like to visualise, and if we logged a couple of asset to our experiment, we will additionally choose our asset. We can then select which consideration layer we’d like to visualise and, optionally, we will select any mixture of consideration heads we’d prefer to see. Note that colour depth of the strains connecting tokens corresponds to the eye weights between tokens.

BertViz interactive visualization, as plotted within the Comet UI. Select experiment, asset, transformer model layer, and attention format.
Users have the choice to specify the experiment, asset, layer, and a spotlight format throughout the Comet UI.

We may also specify how we’d like our tokens to be formatted. For the question-answering instance under, we’ll choose “Sentence A → Sentence B” so we will look at the eye between query and reply:

A BertViz visualization of attention with different sentence structure comparisons
Three other ways to visualise the eye output of BertViz. Graphic by writer

Attention head patterns

Attention heads don’t share parameters, so every head learns a novel consideration mechanism. In the graphic under, consideration heads are examined throughout layers of the identical mannequin given one enter. We can see that totally different consideration heads appear to give attention to very distinctive patterns.

On the highest left, consideration is strongest between equivalent phrases (be aware the crossover the place the 2 cases of “the” intersect). In the highest middle, there’s a give attention to the following phrase within the sentence. On the highest proper and backside left, the eye heads are specializing in every of the delimiters ([SEP] and [CLS], respectively). The backside middle locations emphasis on the comma. And the underside proper is sort of a bag-of-words sample.

BertViz shows that transformer attention captures various patterns in language, including positional patterns, delimiter patterns, and bag-of-words.
BertViz exhibits that spotlight captures numerous patterns in language, together with positional patterns, delimiter patterns, and bag-of-words. Image by writer.

Attention heads additionally seize lexical patterns. In the next graphic, we will see examples of consideration heads that concentrate on record gadgets (left), verbs (middle), and acronyms (on the precise).

BertViz shows transformer attention heads capture lexical patterns like list items, verbs, and acronyms.
BertViz exhibits consideration heads seize lexical patterns like record gadgets, verbs, and acronyms. Image by writer.

Attention head biases

One utility of the top view is detecting mannequin bias. If we offer our mannequin (on this case GPT-2) with two inputs which might be equivalent apart from the ultimate pronouns, we get very totally different generated outputs:

BertViz can help capture model bias in transformer attention mechanisms
On the left, the mannequin assumes “she” is the nurse. On the precise, it assumes “he” is the physician asking the query. Once we’ve detected mannequin bias, how may we increase our coaching knowledge to counteract it? Image by writer.

The mannequin is assuming that “he” refers back to the physician, and “she” to the nurse, which could recommend that the co-reference mechanism is encoding gender bias. We would hope that by figuring out a supply of bias, we will probably work to counteract it (maybe with further coaching knowledge).

The mannequin view is a hen’s-eye perspective of consideration throughout all layers and heads. Here we could discover consideration patterns throughout layers, illustrating the evolution of consideration patterns from enter to output. Each row of figures represents an consideration layer and every column represents particular person consideration heads. To enlarge the determine for any explicit head, we will merely click on on it. Note that yow will discover the identical line sample within the mannequin view as within the head view.

A GIF showing how to enlarge the attention head view in the Comet UI using the model view.
To enlarge an consideration head within the mannequin view, merely click on on it. Notice how the eye sample evolves throughout layers. Image by writer.

Model view functions

So, how may we use the mannequin view? Firstly, as a result of every layer is initialized with separate, unbiased weights, the layers that concentrate on particular patterns for one sentence could give attention to totally different patterns for an additional sentence. So we will’t essentially have a look at the identical consideration heads for a similar patterns throughout experiment runs. With the mannequin view we will extra typically establish which layers could also be specializing in areas of curiosity for a given sentence. Note that this can be a very inexact science and, as many have talked about, “when you search for it, one can find it.” Nonetheless, this view does give us some fascinating perception as to what the mannequin could be specializing in.

In the picture under, we use the identical instance from earlier within the tutorial (left). On the precise, a barely totally different model of the sentence. In each circumstances, GPT-2 generated the final phrase within the sentence. At first, it might appear foolish to assume the canine had too many plans to go to the park. But analyzing the eye heads exhibits us the mannequin was in all probability referring to the “park” as “too busy.”

BertViz helps unravel how a transformer understands language.
On the left, GPT-2 doubtless refers to “the animal” when ending the sentence with “scared.” On the precise, it doubtless refers to “the park” when it finishes the sentence with “busy.” Image by writer.
A horizontal bar chart showing gender discrepancies in Amazon’s hiring practices
In 2018, Amazon scrapped a job applicant recommender system that they had spent 4 years constructing, after realizing the mannequin exhibited vital gender bias. The mannequin had discovered current gender discrepancies in hiring practices and discovered to perpetuate them. Image from Reuters.

As AI turns into extra superior, mannequin calculations can grow to be almost not possible to interpret, even by the engineers and researchers that create them. This can result in a complete host of unintended penalties, together with, however not restricted to: perpetuation of bias and stereotypes, mistrust in organizational decision-making, and even authorized ramifications. Explainable Artificial Intelligence (XAI) is a set of processes used to explain a mannequin’s anticipated influence and potential biases. A dedication to XAI helps:

  • Organizations undertake a accountable strategy to AI growth
  • Developers guarantee a mannequin is working as anticipated and meets regulatory necessities
  • Researchers characterize accuracy, equity, and transparency for decision-making
  • Organizations construct belief and confidence

So how can practitioners incorporate XAI practices into their workflows, when the most well-liked ML architectures right now– transformers– are notoriously opaque? The reply to this query isn’t easy, and explainability should be approached from many alternative angles. But we hope this tutorial offers you yet another software in your XAI software field by serving to you visualize consideration in transformers.

Thanks for making all of it the best way to the tip, and we hope you loved this text. Feel free to attach with us on our Community Slack channel with any questions, feedback, or recommendations!

You may also like

Leave a Comment