One-Hot Encoding, Dot Product and Matrices multiplication: The Basics of Transformers

Spread the love

Introduction

In the world of natural language processing (NLP), everything begins with words. However, computers don’t understand words directly – they need numbers. Our first task is to convert words into numerical representations so that we can perform mathematical operations on them. This is especially important when building systems like voice-activated assistants, where we need to transform sequences of sounds into sequences of words.

To achieve this, we start by defining a Vocabulary, which is the set of symbols (or words) we’ll be working with. For simplicity, let’s assume we’re working with English, which has tens of thousands of words, plus additional terms specific to technology or other domains. This could result in a vocabulary size of nearly a hundred thousand words.

One straightforward way to convert words into numbers is to assign each word a unique integer. For example, in a small vocabulary consisting of the words [basics, of, transformers]. we could assign basics = 1, transformers = 2, and of = 3.

Then, a sentence like “Basics of transformers” would be represented as the sequence [1, 3, 2].

This is called label encoding. While this method is valid, it has some challenges primarily being giving false magnitude interpretation as these numbers have a natural ordering (e.g., 0 < 1 < 2).

One hot encoding: Efficient encoding

There’s a more efficient way to represent words numerically, known as one-hot encoding. In one-hot encoding, each word is represented as a vector (a one-dimensional array) with a length equal to the size of the vocabulary. The vector is mostly zeros, except for a single “1” at the position corresponding to the word’s index. For example, in our small vocabulary:

  • basics = [1, 0, 0]
  • transformers = [0, 1, 0]
  • of = [0, 0, 1]

Using this encoding, the sentence “Basics of transformers” becomes a sequence of vectors: [[1, 0, 0], [0, 0, 1], [0, 1, 0]]. When these vectors are stacked together, they form a two-dimensional array, or matrix.

Dot Product: Measuring Similarity

One of the key advantages of one-hot encoding is that it allows us to compute dot products, (can be referred to as inner products or scalar products). The dot product of two vectors is calculated by multiplying their corresponding elements and summing the results. For example, consider two vectors A = [a₁, a₂, a₃] and B = [b₁, b₂, b₃]. Their dot product is:

A · B = (a₁ × b₁) + (a₂ × b₂) + (a₃ × b₃)

When working with one-hot encoded vectors, the dot product has some useful properties.

For instance,

·       the dot product of a one-hot vector with itself is always 1, because the single 1 in the vector is multiplied by itself, and all other elements are zeros.

·       the dot product of any two different one-hot vectors is always 0, since there are no overlapping 1’s in their positions.

These properties make dot products particularly useful for measuring similarity between vectors. In first instance, its perfect similarity and in second one, its orthogonality or no similarity.

For example, if we have a vector that represents a combination of words with different weights, we can use the dot product to determine how strongly a specific word is represented in that combination.

Matrix Multiplication: Extending the Dot Product

The concept of the dot product is very important for matrix multiplication, which is a fundamental operation in many machine learning models, including transformers. It provides a mechanism to combine pair of 2-d arrays.

In simplest case, of having 2 matrices X and Y where X has 1 row and Y has 1 column, the matrix multiple is same as dot product.

Matrix multiplication requires that number of columns in the first matrix (X) must match the number of rows in the second matrix (Y).

Now as X and Y grows similar approach works –

Observe how matrix multiplication acts as a lookup table. Matrix X is made up of a stack of

one-hot vectors. They have ones in the first and second columns respectively. Through matrix multiplication, this serves to pull out the first and second rows of Y matrix, in that order.

This idea serves as a tool to perform large-scale computations efficiently, especially when dealing with high-dimensional data like word embeddings in transformers.

Summarizing all above

This concept of using one-hot vectors to select specific rows from a matrix is at the heart of how transformers operate. Transformers rely heavily on matrix multiplications to process sequences of data, such as words in a sentence. By encoding information as one-hot vectors, transformers can efficiently retrieve and manipulate specific pieces of data from large matrices, enabling them to perform complex tasks like language translation, text generation, and more.

In the upcoming article we will further break down essential concepts; eventually to bring all together to explain Transformers architecture.

Leave a Reply