{"id":156,"date":"2025-02-28T12:18:50","date_gmt":"2025-02-28T12:18:50","guid":{"rendered":"https:\/\/ai.munishkaushik.com\/?p=156"},"modified":"2025-02-28T12:18:52","modified_gmt":"2025-02-28T12:18:52","slug":"one-hot-encoding-dot-product-and-matrices-multiplication-the-basics-of-transformers","status":"publish","type":"post","link":"https:\/\/ai.munishkaushik.com\/index.php\/2025\/02\/28\/one-hot-encoding-dot-product-and-matrices-multiplication-the-basics-of-transformers\/","title":{"rendered":"One-Hot Encoding, Dot Product and Matrices multiplication: The Basics of Transformers"},"content":{"rendered":"\n<p id=\"ember603\"><strong>Introduction<\/strong><\/p>\n\n\n\n<p id=\"ember604\">In the world of natural language processing (NLP), everything begins with words. However, computers don\u2019t understand words directly &#8211; they need numbers. Our first task is to convert words into numerical representations so that we can perform mathematical operations on them. This is especially important when building systems like voice-activated assistants, where we need to transform sequences of sounds into sequences of words.<\/p>\n\n\n\n<p id=\"ember605\">To achieve this, we start by defining a <strong>Vocabulary<\/strong>, which is the set of symbols (or words) we\u2019ll be working with. For simplicity, let\u2019s assume we\u2019re working with English, which has tens of thousands of words, plus additional terms specific to technology or other domains. This could result in a vocabulary size of nearly a hundred thousand words.<\/p>\n\n\n\n<p id=\"ember606\">One straightforward way to convert words into numbers is to assign each word a unique integer. For example, in a small vocabulary consisting of the words [<em>basics<\/em>, <em>of<\/em>, <em>transformers<\/em>]. we could assign <em>basics<\/em> = 1, <em>transformers<\/em> = 2, and <em>of<\/em> = 3.<\/p>\n\n\n\n<p id=\"ember607\">Then, a sentence like &#8220;Basics of transformers&#8221; would be represented as the sequence [1, 3, 2].<\/p>\n\n\n\n<p id=\"ember608\">This is called label encoding. While this method is valid, it has some challenges primarily being giving false magnitude interpretation as these numbers have a natural ordering (e.g., 0 &lt; 1 &lt; 2).<\/p>\n\n\n\n<p id=\"ember609\"><strong>One hot encoding: Efficient encoding<\/strong><\/p>\n\n\n\n<p id=\"ember610\">There\u2019s a more efficient way to represent words numerically, known as <strong>one-hot encoding<\/strong>. In one-hot encoding, each word is represented as a vector (a one-dimensional array) with a length equal to the size of the vocabulary. The vector is mostly zeros, except for a single &#8220;1&#8221; at the position corresponding to the word\u2019s index. For example, in our small vocabulary:<\/p>\n\n\n\n<ul><li><em>basics<\/em> = [1, 0, 0]<\/li><li><em>transformers<\/em> = [0, 1, 0]<\/li><li><em>of<\/em> = [0, 0, 1]<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQEbR8nSQnY4Ag\/article-inline_image-shrink_400_744\/B56ZVHUPqfGUAY-\/0\/1740658244724?e=1746057600&amp;v=beta&amp;t=NXkDyMtrKw6QdAU5HLRNu2lCqjt00BEFxkFE0_kpH08\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember615\">Using this encoding, the sentence &#8220;Basics of transformers&#8221; becomes a sequence of vectors: [[1, 0, 0], [0, 0, 1], [0, 1, 0]]. When these vectors are stacked together, they form a two-dimensional array, or matrix.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQGLtVextS1tiA\/article-inline_image-shrink_1500_2232\/B56ZVHUPpUHEAY-\/0\/1740658244586?e=1746057600&amp;v=beta&amp;t=5VW8XEnYvN99HAgMM4xWwRsuoP-NPEtLyZTvVsboqho\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember617\"><strong>Dot Product: Measuring Similarity<\/strong><\/p>\n\n\n\n<p id=\"ember618\">One of the key advantages of one-hot encoding is that it allows us to compute <strong>dot products<\/strong>, (can be referred to as inner products or scalar products). The dot product of two vectors is calculated by multiplying their corresponding elements and summing the results. For example, consider two vectors <strong>A<\/strong> = [a\u2081, a\u2082, a\u2083] and <strong>B<\/strong> = [b\u2081, b\u2082, b\u2083]. Their dot product is:<\/p>\n\n\n\n<p id=\"ember619\"><strong>A \u00b7 B<\/strong> = (a\u2081 \u00d7 b\u2081) + (a\u2082 \u00d7 b\u2082) + (a\u2083 \u00d7 b\u2083)<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQGXrGpF-le2Lw\/article-inline_image-shrink_1500_2232\/B56ZVHUPqZHEAU-\/0\/1740658244660?e=1746057600&amp;v=beta&amp;t=wwkRrrtFjT--t43qtL3AHc1SLBYEAoiFJs1QaLQqjK4\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember621\">When working with one-hot encoded vectors, the dot product has some useful properties.<\/p>\n\n\n\n<p id=\"ember622\">For instance,<\/p>\n\n\n\n<p id=\"ember623\">\u00b7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the dot product of a one-hot vector with itself is always 1, because the single 1 in the vector is multiplied by itself, and all other elements are zeros.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQGqeoilEES7FA\/article-inline_image-shrink_1000_1488\/B56ZVHUPsAGoAQ-\/0\/1740658244781?e=1746057600&amp;v=beta&amp;t=z8HTQwmYoGHPcLE2jHHcfGIg8urHglyaP0zcDY3wowY\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember625\">\u00b7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; the dot product of any two different one-hot vectors is always 0, since there are no overlapping 1\u2019s in their positions.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQEdqq7DWNc2WA\/article-inline_image-shrink_1500_2232\/B56ZVHUPp0GsAU-\/0\/1740658244646?e=1746057600&amp;v=beta&amp;t=NLyGUKV-K3PrXGWO3gZMIQvTWd_xhHFhqxTrgHUBnJM\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember627\">These properties make dot products particularly useful for measuring similarity between vectors. In first instance, its perfect similarity and in second one, its orthogonality or no similarity.<\/p>\n\n\n\n<p id=\"ember628\">For example, if we have a vector that represents a combination of words with different weights, we can use the dot product to determine how strongly a specific word is represented in that combination.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQEFePDwu-CXDQ\/article-inline_image-shrink_1000_1488\/B56ZVHUPrMHoAQ-\/0\/1740658244743?e=1746057600&amp;v=beta&amp;t=VWelDy2tuuAoeSp0g0pVncqjDXWLxeI0MRJGmnUblWs\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember630\"><strong>Matrix Multiplication: Extending the Dot Product<\/strong><\/p>\n\n\n\n<p id=\"ember631\">The concept of the dot product is very important for matrix multiplication, which is a fundamental operation in many machine learning models, including transformers. It provides a mechanism to combine pair of 2-d arrays.<\/p>\n\n\n\n<p id=\"ember632\">In simplest case, of having 2 matrices X and Y where X has 1 row and Y has 1 column, the matrix multiple is same as dot product.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQFjgxLmhVDPUg\/article-inline_image-shrink_1500_2232\/B56ZVHUPqpGUAU-\/0\/1740658244726?e=1746057600&amp;v=beta&amp;t=QFzgvDsMy_g_Ivx2mdYnZWZHRcTsvZJtMIqrW8l67K4\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember634\">Matrix multiplication requires that number of columns in the first matrix (<strong>X<\/strong>) must match the number of rows in the second matrix (<strong>Y<\/strong>).<\/p>\n\n\n\n<p id=\"ember635\">Now as X and Y grows similar approach works &#8211;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQEvRVVHxRr_Fg\/article-inline_image-shrink_1000_1488\/B56ZVHUPrPHoAU-\/0\/1740658244761?e=1746057600&amp;v=beta&amp;t=Sy1RV8WLKkERpCtRPkJD9s725jvdcti_76MfeLIfbsM\" alt=\"\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQFzyopRbfB6Rw\/article-inline_image-shrink_1500_2232\/B56ZVHUPqnHsAY-\/0\/1740658244737?e=1746057600&amp;v=beta&amp;t=RG2eyFaodsadeusk6nfK5Btynj1lxvjFPbuaWYNsm5M\" alt=\"\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/media.licdn.com\/dms\/image\/v2\/D5612AQEsKg3b9HKGfA\/article-inline_image-shrink_1500_2232\/B56ZVHUPrUHsAY-\/0\/1740658244807?e=1746057600&amp;v=beta&amp;t=Xyml_3T3O-PBapzvhPArdfhvDm52E0g-4AwGHqPnDss\" alt=\"\"\/><\/figure>\n\n\n\n<p id=\"ember641\">Observe how matrix multiplication acts as a lookup table. Matrix <strong>X<\/strong> is made up of a stack of<\/p>\n\n\n\n<p id=\"ember642\">one-hot vectors. They have ones in the first and second columns respectively. Through matrix multiplication, this serves to pull out the first and second rows of <strong>Y<\/strong><em> <\/em>matrix, in that order.<\/p>\n\n\n\n<p id=\"ember643\">This idea serves as a tool to perform large-scale computations efficiently, especially when dealing with high-dimensional data like word embeddings in transformers.<\/p>\n\n\n\n<p id=\"ember644\"><strong>Summarizing all above<\/strong><\/p>\n\n\n\n<p id=\"ember645\">This concept of using one-hot vectors to select specific rows from a matrix is at the heart of how&nbsp;transformers&nbsp;operate. Transformers rely heavily on matrix multiplications to process sequences of data, such as words in a sentence. By encoding information as one-hot vectors, transformers can efficiently retrieve and manipulate specific pieces of data from large matrices, enabling them to perform complex tasks like language translation, text generation, and more.<\/p>\n\n\n\n<p id=\"ember646\">In the upcoming article we will further break down essential concepts; eventually to bring all together to explain Transformers architecture.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In the world of natural language processing (NLP), everything begins with words. However, computers don\u2019t understand words directly &#8211; they need numbers. Our first task is to convert words into numerical representations so that we can perform mathematical operations on them. This is especially important when building systems like voice-activated assistants, where we need [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":157,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[25,24,42],"tags":[43,44],"_links":{"self":[{"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/posts\/156"}],"collection":[{"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/comments?post=156"}],"version-history":[{"count":1,"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/posts\/156\/revisions"}],"predecessor-version":[{"id":158,"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/posts\/156\/revisions\/158"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/media\/157"}],"wp:attachment":[{"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/media?parent=156"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/categories?post=156"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ai.munishkaushik.com\/index.php\/wp-json\/wp\/v2\/tags?post=156"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}