Transformer Dataflow (types view)

High-level, type-flavored view of how data is transformed through an autoregressive (next-token) transformer-based LLM.

Tokenizer
1. string -> vec[int] (tokens)
2. Truncate/pad to sequence length
Embedding layer
1. vec[int] -> vec[vec[float]]
2. Turns tokens into a vector of embedding vectors
3. Embedding vectors are also adjusted by positional encoding
Many transformer blocks
1. vec[vec[float]] -> vec[vec[float]]
2. Hidden states/activations flow through transformer blocks
LM head
1. vec[vec[float]] -> vec[vec[float]]
2. Turns the final activations into a set of vocab scores (logits) for each position
Final softmax
1. vec[vec[float]] -> vec[vec[float]]
2. Each position gets a probability distribution across the vocabulary (sometimes you decode directly from logits)

Output usage (depends on training vs inference)

Training: vec[vec[float]] -> float
1. The logits and next-token targets are used to calculate cross-entropy loss
Inference: vec[vec[float]][-1] -> logits (last position) -> decoding -> next token id (int); append to the sequence and repeat; detokenize via tokenizer

Training usually computes logits for every position and compares to the next token (teacher forcing).
Generation usually uses only last-position logits each step.

Matto