High-level, type-flavored view of how data is transformed through an autoregressive (next-token) transformer-based LLM.
- Tokenizer
string -> vec[int](tokens)- Truncate/pad to sequence length
- Embedding layer
vec[int] -> vec[vec[float]]- Turns tokens into a vector of embedding vectors
- Embedding vectors are also adjusted by positional encoding
- Many transformer blocks
vec[vec[float]] -> vec[vec[float]]- Hidden states/activations flow through transformer blocks
- LM head
vec[vec[float]] -> vec[vec[float]]- Turns the final activations into a set of vocab scores (logits) for each position
- Final softmax
vec[vec[float]] -> vec[vec[float]]- Each position gets a probability distribution across the vocabulary (sometimes you decode directly from logits)
Output usage (depends on training vs inference)
- Training:
vec[vec[float]] -> float- The logits and next-token targets are used to calculate cross-entropy loss
- Inference:
vec[vec[float]][-1] ->logits (last position) -> decoding -> next token id (int); append to the sequence and repeat; detokenize via tokenizer
- Training usually computes logits for every position and compares to the next token (teacher forcing).
- Generation usually uses only last-position logits each step.