How Major Language Models (LLMs) Understand Your Writing
Large language models (LLMs) are types of artificial intelligence (AI) programs trained on large datasets—hence the name “large”—that enable them to recognize, understand, and create natural language texts, among other tasks.
Thanks to their significant contributions to the advancement of generative AI, LLMs have recently gained widespread recognition.
They have also become a focal point for organizations looking to integrate artificial intelligence into a variety of business operations and applications.
LLMs learn to understand and process human language and other complex data by being exposed to massive amounts of examples, often meaning thousands or millions of gigabytes of text from across the internet.
These models leverage deep learning to determine relationships between characters, words, and sentences by probabilistically analyzing unstructured data.
This allows them to identify different types of content autonomously, without the need for direct human guidance.
Whether it’s understanding questions, writing answers, classifying content, completing sentences, or even translating text into another language, these AI models can be adapted to solve specific problems in different industries.
Like super-readers in a giant library full of books, they absorb tons of information to learn how language works.
In this article, we’ll dive into the fascinating world of large language models and their inner workings.
Main characteristics of major linguistic models
Large dimensions |
|
|
General use |
|
|
Pre-trained, tuned and multimodal |
|
Basic mechanics of large linguistic models
At the heart of LLMs, we encounter the transformer model, which is crucial to understanding how these models work.
Transformers consist of an encoder and a decoder , which process data by breaking down the inputs into tokens.
They perform complex mathematical computations to analyze the relationships between these tokens and arrive at an output.
In essence, the encoder “encodes” the input sequence and passes it to the decoder, which learns to “decode” the representations for a relevant task.
Transformers allow a computer to recognize patterns similar to human cognition.
These models rely on self-attention mechanisms, which allows them to learn faster than older models, such as long-short-term memory (LSTM) models.
The self-attention mechanism allows them to process each segment of a sequence of words while taking into account the context provided by other words in the same sentence.
Input Encoding
The first step is to convert the input sentence into a series of word embeddings.
Each word is transformed into a vector that represents its semantic meaning in a high-dimensional space.
Word embedding efficiently captures the meaning of a word, ensuring that words positioned closely in the vector space share similar meanings.
Example: [Embedding for 'The', Embedding for 'cat', Embedding for 'sat', Embedding for 'on', Embedding for 'the', Embedding for 'mat', Embedding for '.']
Generate queries, keys and values
Then, the self-attention mechanism produces three different forms of input embeddings: queries, keys, and values.
These are created by linear transformations of the original records and play a key role in computing attention scores.
Queries: [Query for 'The', Query for 'cat', Query for 'sat', Query for 'on', Query for 'the', Query for 'mat', Query for '.']
Keys: [Key for 'The', Key for 'cat', Key for 'sat', Query for 'on', Query for 'the', Query for 'mat', Query for '.']
Values: [Value for 'The', Value for 'cat', Value for 'sat', Query for 'on', Query for 'the', Query for 'mat', Query for '.']
Example of random data
Queries: [[0.23,0.4,0.67,...],[0.4,0.6,0.67,...],[0.2,0.2,0.67,...],[0.5,0.3,0.8,...], [0.1,0.4,0.67,...], [0.2,0.4,0.67,...],[0.7,0.4,0.6,...]]
Keys: [[0.1,0.4,0.5,...],[0.2,0.4,0.67,...],[0.3,0.4,0.67,...],[0.4,0.4,0.67,...], [0.5,0.4,0.67,...], [0.6,0.7,0.8,...],[0.6,0.4,0.8,...]]
Values: [[0.4,0.5,0.67,...],[0.23,0.4,0.5,...],[0.23,0.4,0.8,...],[0.23,0.4,0.45,...],[0.23,0.4,0.9,...],[0.23,0.4,0.6,...],[0.23,0.4,0.10,...]]
Determination of attention scores
Attention scores for 'The': [0.9,0.7,0.5, 0.4,0.45,0.56,0.23]
Attention scores for 'cat': [0.6,0.5,0.7, 0.23,0.44,0.58,0.23]
...
Attention scores for '.': [0.3,0.5,0.9, 0.4,0.45,0.56,0.23]
Using SoftMax
Example: Softmax of attention scores for 'The': [0.29, 0.1, 0.12, 0.14, 0.1, 0.1, 0.14]
Calculation of the weighted sum
Example: Context-aware representation: [0.29 * Value for 'The' + 0.1 * Value for 'cat' + 0.12 * Value for 'sat' + …]
The resulting representation captures the contextual meaning of all words, taking into account their associations with other words in the sentence, which improves the model’s predictive capabilities.
After multiplying the values, you get a 2D matrix.
The language model then selects the most likely option.
This method is known as the “greedy approach,” which is characterized by a lack of creativity, with the model consistently opting for the same word.
On the contrary, the language model can also choose randomly, which leads to more creative results.
In the sentence: The cat sat on…, the next word will most likely be “ the . ”
However, if we choose randomly among other options, we might get something like “ bottle ” or “ plate ,” which obviously has a much lower probability.
To control for creativity levels, we need to adjust the temperature parameter that influences the models’ results.
Temperature is a numerical value (often between 0 and 1, but sometimes higher) that is essential to fine-tune the model’s performance.
Temperature adjustment : This parameter is directly incorporated into the SoftMax function.
In short, if we want to get the same responses without creativity, we must decrease the temperature.
On the other hand, if we want fresher and more original responses, we must increase the value of the parameter.
We describe below how different temperature values change the probability distribution of the next word in a sentence:
- Low Temperature (below 1.0) – A temperature below 1 allows the model to produce more predictable and less diverse results.
It narrows the model’s choices, often opting for the most likely word, which can make the text seem less creative or varied, or even a little more mechanical.
This setting is ideal when you want to get straightforward and less surprising answers. - High Temperature (greater than 1.0) – A temperature greater than 1 introduces more unpredictability into text generation.
The model ventures beyond the obvious choices, choosing less likely words, which can make the content more diverse and potentially more creative.
But beware, this can also lead to more errors, or even nonsense, as the model strays further from the probability paths of its training data. - Set the temperature to 1.0 – Often the middle ground, a temperature of 1.0 seeks to strike a balance between the predictable and the unpredictable.
In this setting, the model produces text that is a mix, neither too monotonous nor too chaotic, reflecting the probability distribution it was trained on.
Types of large linguistic models
The terminology we use to categorize the different types of large language models is constantly evolving as they become extremely flexible and adaptable.
Here are the three main types of LLMs you’ll often hear about:
- Generic Language Models : This model is trained to predict the next word (also called a » token » ) based on the language of the training data.
In the following example, the sentence « The cat sat on… », the next word should be « the », which is the most likely next word.
Think of this type of LLM as a very sophisticated autocomplete function in a search context.
These models calculate the probability of a token or series of tokens appearing in a longer sequence of tokens.
By considering a token as a word, a language model predicts the chances that various words or sequences of words will fill the gap.
- Instruction-aware : This type of model predicts a response to instructions given as input.
For example, if you ask to summarize a text by « x » and generate a poem in the style of « x », give me a list of keywords based on semantic similarity for « x ».
This approach has a similar goal to fine-tuning, in that it trains a model on a particular task through few-shot or zero-shot prompts.
A prompt is essentially an instruction provided to an LLM.
Few-shot prompting teaches the model to predict outcomes by presenting examples.
For example, in a sentiment analysis task, a few-second prompt might appear as follows:
Example of incentive for sentiment analysis from a small number of images
- Customer Comment: This plant is so beautiful!
- Customer sentiment: Positive
- Customer Comment: This plant is so hideous!
- Customer sentiment: Negative
The language model, aware of the semantic implication of « hideous », and given the context of an opposing example, would discern that the client’s sentiment in the second case is « negative ».
On the other hand, the « zero » prompt does not rely on examples to guide the language model’s response to inputs.
Instead, it formulates the prompt as follows: « The sentiment in ‘This plant is so hideous’ is… », directly signaling the task to be performed by the model without offering examples for solving the problem.
We can also classify text as neutral, negative, or positive in a given context.
- Dialogued : This type of model is trained to have a dialogue before the next response.
Dialogue-driven models are a special case of instruction-driven models, in which requests are typically formulated as questions to a chatbot.
The dialogue is meant to take place in the context of a longer conversation and typically works best with natural question-like formulations.
This type of LLM may include chain-of-thought reasoning, that is, the observation that models are better at getting the right answer when they first produce text explaining the reason for the answer.
Examples of prompts | ||
---|---|---|
Models get the right answer more easily when they first produce text explaining the reason for the answer. | Q: Roger has 5 tennis balls. He buys 2 more boxes of tennis balls. Each box contains 3 tennis balls. How many tennis balls does he have now?HAS : |
|
The model is less likely to get the correct answer directly | Q: Roger has 5 tennis balls. He buys 2 more boxes of tennis balls. Each box contains 3 tennis balls. How many tennis balls does he have now? A: Let’s think step by step. |
|
Now the exit is more likely to end with the correct answer | A: Roger started with 5 balls. 2 boxes of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. |
The most popular LLMs today
Large language models have played a pivotal role in the rise of generative AI technology seen in 2023.
Most of these models are built on the transformer architecture, such as the Generative Pre-trained Transformer (GPT) series and Bidirectional Encoder Representations from Transformers (BERT).
After its launch in 2022, ChatGPT (from OpenAI) quickly gained a massive user base, attracting over 100 million users in just two months.
This success has led to the release of many competing models from major companies such as Google and Microsoft, as well as the open source community.
The LLM landscape includes a wide range of influential models, both historical and contemporary, that have paved the way for current leaders or are poised to have a significant impact on the future.
Among these models, some have shaped the direction of today’s AI capabilities, while others, perhaps less recognized, have the potential to advance the next wave of innovation.
Below are some of the most important large language models today, known for their natural language processing capabilities and influence on the design of future models:
- BERT: Introduced by Google in 2018, BERT represents a series of LLMs built on transformer technology, capable of transforming data sequences into other data sequences.
BERT’s architecture consists of a series of transformer encoders, totaling 342 million parameters.
Initially pre-trained on a large dataset, BERT was later fine-tuned for specific tasks, including natural language inference and sentence-level text similarity.
In its 2019 update, Google Search leveraged BERT to improve its understanding of search queries. - Gemini : Gemini — Google’s LLM suite — powers and shares its name with the company’s chatbot, succeeding PaLM and undergoing a rebranding from Bard to Gemini.
Unique in its multimodal capabilities, Gemini models can handle not only text, but also images, audio, and video.
They have been integrated into a large number of Google apps and products, with Ultra, Pro, and Nano variants.
Ultra is the largest and most sophisticated option, Pro serves as a middle model, and Nano is the smallest version, optimized for efficient on-device operations. - GPT-3.5 : GPT-3.5, an improved variant of GPT-3, has a reduced number of parameters and has been refined using reinforcement learning based on human feedback.
This version enhances the capabilities of ChatGPT.
Among its variants, GPT-3.5 Turbo stands out as the most advanced, according to OpenAI’s evaluation.
The training dataset for GPT-3.5 extends until September 2021.
In addition, it has recently been integrated into the Bing search engine, although it has been replaced by GPT-4. - GPT-4: Released in 2023, GPT-4 is the largest model in OpenAI’s GPT lineup.
True to tradition, it is built on a transforming structure.
However, the exact number of its parameters has not been revealed, but speculation suggests that it exceeds 170 trillion.
OpenAI highlights GPT-4’s multimodal capabilities, which allow it to understand and create content in the form of text and images, beyond the simple text functions of its predecessors.
In addition, GPT-4 brings a new feature – a system message function, allowing users to set the tone of voice and specific tasks. - Llama: In 2023, Meta introduced its large language model, Llama, marking its entry into the LLM arena.
Capable of up to 65 billion parameters in its largest iteration, Llama initially served an exclusive group of researchers and developers before moving to an open-source model.
Built on the transformer model, Llama’s training involved a diverse group of public datasets, such as CommonCrawl, GitHub, Wikipedia, and Project Gutenberg.
Following an unintentional leak, Llama spawned a series of derivative products, including Vicuna and Orca, expanding its legacy in the AI space. - Orca: Microsoft’s Orca, with its 13 billion parameters, is compact enough to run on a laptop.
It seeks to take advantage of the progress made by other open-source models by replicating the reasoning capabilities of large language models.
Despite having far fewer parameters, Orca matches the performance of GPT-4 and matches GPT-3.5 on many tasks.
The foundation of Orca is the 13 billion parameter iteration of LLaMA. - xAI Grok: In mid-March 2024, X.ai unveiled the Grok-1 chatbot, the largest open-source language model (LLM) to date, with 314 billion parameters.
This is the largest open-source model available, far surpassing previous models such as Falcon 180B, which had 180 billion parameters.
Grok-1 is based on a mixture of experts (MoE) model, which only activates 25% of its weights for a specific element during inference.
Official statements indicate that it has not been developed for specific uses such as chatbots.
Overcoming Challenges
Most of the information used to train state-of-the-art language models comes from collecting texts from across the internet, such as the Common Crawl dataset , which contains data from over 3 billion web pages.
This massive data set contains a ton of private information from all types of people who have something about them online.
This information can be accurate, inaccurate, or even downright false.
This type of scenario raises data protection and privacy issues that are exceptionally difficult to address.
Furthermore, without adequate safeguards, the output generated by large-scale language models could leak sensitive or private data contained in the training datasets, leading to actual or potential data breaches
. The advantage is that large language models (LLMs) are not designed with flaws that make them susceptible to leaking private data from the start.
A large model is not going to start leaking private information because of the way it is built.
The risk of data breaches is more related to how the people who manage the model manage and use it.
On the other hand, large language models can occasionally “hallucinate,” generating false information that appears to be accurate.
These hallucinations can lead to the dissemination of incorrect, absurd, or misleading information about individuals, which could permanently damage a person’s reputation and influence decisions about them.
Furthermore, when LLMs are trained on biased datasets, they risk reinforcing or even exacerbating the biases in that data.
This leads to discriminatory or unfair results, which may violate the standard of fair processing of personal data.
The way forward
As large language models (LLMs) continue to evolve, they are expected to improve across the board.
Future versions will likely produce answers that are not only more consistent, but also feature advanced capabilities for identifying and reducing bias, as well as greater transparency.
These advancements promise to make LLMs a trusted tool across industries such as finance, manufacturing, content generation, healthcare, and education.
The market is expected to grow in the number and variety of available LLMs, giving organizations a wider range of options to determine the LLM that is best suited for their specific AI initiatives.
Customizing these models is also expected to become much easier and more precise, optimizing AI applications for speed, efficiency, and productivity.
In the future, the cost of large language models will decrease significantly, allowing smaller businesses to leverage their benefits and capabilities.