High-level, type-flavored view of how data is transformed through an autoregressive (next-token) transformer-based LLM.

  1. Tokenizer
    1. string -> vec[int] (tokens)
    2. Truncate/pad to sequence length
  2. Embedding layer
    1. vec[int] -> vec[vec[float]]
    2. Turns tokens into a vector of embedding vectors
    3. Embedding vectors are also adjusted by positional encoding
  3. Many transformer blocks
    1. vec[vec[float]] -> vec[vec[float]]
    2. Hidden states/activations flow through transformer blocks
  4. LM head
    1. vec[vec[float]] -> vec[vec[float]]
    2. Turns the final activations into a set of vocab scores (logits) for each position
  5. Final softmax
    1. vec[vec[float]] -> vec[vec[float]]
    2. Each position gets a probability distribution across the vocabulary (sometimes you decode directly from logits)

Output usage (depends on training vs inference)

  1. Training: vec[vec[float]] -> float
    1. The logits and next-token targets are used to calculate cross-entropy loss
  2. Inference: vec[vec[float]][-1] -> logits (last position) -> decoding -> next token id (int); append to the sequence and repeat; detokenize via tokenizer
  • Training usually computes logits for every position and compares to the next token (teacher forcing).
  • Generation usually uses only last-position logits each step.