Token Sampling
- Author: Abdul Dakkak
- Date: November 20, 2024
Weight or token sampling in LLM inference is a technique that introduces randomness and diversity into the text generation process. Instead of selecting the next token solely based on the highest probability predicted by the model, weight sampling allows for the selection of less probable tokens based on their assigned probabilities. This leads to more varied and creative text generation, making LLMs less predictable and more capable of producing surprising and engaging outputs.
Terms
Weight/token Sampling
Auto-regressive decoding generate output by predicting the next token based on previous ones, using a vector of logits to represent the probability of each token. Post-processing techniques are used to control how the next token is determined, balancing between predictability and creativity. This process of post processing is called Weight sampling.
Temperature
In the context of weight sampling, temperature is a hyper-parameter that controls the randomness of the sampling process. It influences the probability distribution over the possible quantized values for each weight.
- High temperature: The sampling becomes more random, and the probability distribution is flatter. This means that less likely quantized values have a higher chance of being selected.
- Low temperature: The sampling is more deterministic, and the probability distribution is more concentrated around the most likely quantized values.
By adjusting the temperature (which is an adjustment to the parameter in the softmax function), one can control the trade-off between exploration (trying different quantized values) and exploitation (sticking to the most likely values) during weight sampling. This can impact the final quantized model's performance and accuracy.
Algorithms
Sampling
Randomly selects a token based on the token probability distribution. This method introduces variety but can lead to nonsensical outputs if not constrained.
Beam search
Maintains a set of k most probable partial sequences and expands them at each step, considering multiple hypotheses simultaneously. It does that by keeping track of several possible next tokens for each generation, reducing the chance of missing high-probability tokens. This method is deterministic and can find sequences with higher or equal probability than greedy search but requires performing inference multiple times.
While beam search is a powerful algorithm for certain NLP tasks, its limitations in terms of diversity and exploration make it less ideal for sampling in LLMs, especially when more diverse or creative outputs are desired.
Greedy sampling
Greedy weight sampling selects the token with the highest probability at each step. This approach is deterministic, meaning it always chooses the most likely next token based on the model's predictions.
Given a probability distribution \(P=\lbrace p_1, p_2, \ldots, p_n \rbrace\) over the vocabulary, where \(p_i\) is the probability of the \(i^{th}\) word, the greedy sampling value is computed by
- Sort P in descending order: \(P_{sorted}=\lbrace p_1^′,p_2^′,...,p_n^′ \rbrace\)
- Select the top 1 word: \(\texttt{arg max}_i = p_i^′\)
Greedy sampling differs from stochastic sampling methods like top-k and top-p sampling, which introduce randomness into the sampling process to generate more diverse outputs. While greedy sampling is efficient and predictable, it may not always produce the most natural or creative text.
Implementation
top-k sampling
Selects the next word from the top k most probable words, allowing for more
diversity compared to greedy sampling. When k=1 then top-k is the same as
greedy sampling.
Given a probability distribution \(P=\lbrace p_1, p_2, \ldots, p_n \rbrace\) over the vocabulary, where \(p_i\) is the probability of the \(i^{th}\) word, the top-k value is computed by
- Sort P in descending order: \(P_{sorted}=\lbrace p_1^′,p_2^′,...,p_n^′ \rbrace\)
- Select the top K words: \(P_{topk}=\lbrace p_1^′,p_2^′,...,p_K^′ \rbrace\)
- Normalize the probabilities of \(P_{topk}\)
- Sample the next word from the normalized probabilities
Implementation
top-p sampling
top-p sampling, also known as nucleus sampling, is a technique used in LLMs to control the randomness and quality of generated text. It is designed to balance creativity and coherence in AI-generated content by dynamically adjusting the sampling process based on the probability distribution of the next word in a sequence.
Given a sorted list of word probabilities:
top-p sampling selects the smallest set of words whose cumulative probability exceeds the threshold
Where \(k\) is the number of words in the set.
top-p sampling is more dynamic than Top-K sampling, which always considers a fixed number of options. Unlike temperature sampling, which can sometimes produce odd results, top-p sampling adjusts based on the probability distribution, making it a versatile and effective method for controlling the quality and diversity of generated text.
Implementation
min-p sampling
The min-p sampling method, conceived as an alternative to top-p sampling,
strives to achieve a balance between the coherence and diversity of generated
text. The parameter p establishes a dynamic probability threshold for token
selection during the sampling process. This threshold is computed as the
product of p and the probability assigned to the most probable token.
Consequently, tokens with logits falling below this calculated threshold are
excluded from consideration.
Given a probability distribution \(P=\lbrace p_1, p_2, \ldots, p_n \rbrace\) over the vocabulary, where \(p_i\) is the probability of the \(i^{th}\) word, the top-k value is computed by
- Find the maximum probability value \(p_{max} = \texttt{max}(P)\)
- Calculate the base probability threshold: \(\texttt{threshold} = p_{max} * \texttt{min}(P)\)
- Filter the tokens based on the threshold: \(\lbrace i | p_i \ge \texttt{threshold} \rbrace\)
- Normalize the filtered probabilities
- Sample from the normalized filtered tokens
min-p sampling differs from top-k and top-p sampling by its dynamic nature and ability to adapt to the probability distribution of tokens. Unlike top-k, which selects a fixed number of tokens, and top-p, which considers a cumulative probability threshold, min-p sampling uses a relative probability threshold to filter tokens, making it more flexible and effective in balancing creativity and coherence.0.
Comparisons
| Method | Description | Pros | Cons |
|---|---|---|---|
| Greedy Sampling | Selects the highest probability token at each step. | Fast and deterministic | Can lack diversity |
| Beam Search | Explores multiple sequences simultaneously, keeping track of top candidates. | Balances exploration and quality | More computationally intensive |
| Stochastic Sampling | Introduces randomness by sampling from the probability distribution. | Produces diverse outputs | Less predictable |

Effects of top-p, top-k, and min-p sampling on token probability distributions. min-p sampling dynamically adjusts its filtering threshold based on the model's confidence, focusing on high-probability tokens when confident and including diverse but plausible options when uncertain. This helps min-p balances coherence and diversity more effectively than top-p and top-k sampling. Source: Minh, et. al., Figure 1.
References
- The Curious Case of Neural Text Degeneration
- GPT Runtime - TensorRT-LLM Documentation
- Controlling Generation with Top-k and Top-p - Cohere Documentation
- Token Sampling Primer - Aman AI
- Your settings are probably hurting your model - why? - Reddit
- Sampling - LabML
- Token Sampling Concepts - Vinija AI
- Sampling - Huyenchip's Blog
- llama.cpp Examples README
- A Survey of Large Language Model Inference: From Algorithm to System
- Customizing LLM Output: Post-Processing Techniques - Neptune AI
- The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction
- Min-P Sampling: A Simple and Effective Alternative to Top-P Sampling
- A Survey of Large Language Model Inference: From Algorithm to System
- Explained Completions - Interactive Token Sampling Visualization
- Token Sampling Interactive Demo - Google Colab