- A history of Nvidia Stream Multiprocessor
- Host overhead is killing your inference efficiency
- How to think about GPUs
- How to Think About TPUs
- Making Deep Learning Go Brrrr From First Principles
- The Ultra-Scale Playbook: Training LLMs on GPU Clusters
- TPU Deep Dive
- The Illustrated Transformer
- ML Engineering Open Book
- CUDA for ML - Intuitively and Exhaustively Explained
- Demystifying GPU Compute Architectures
- Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
- Arm at Amazon
- The Computer Architecture of AI (in 2024)
- What Happened with FPGA Acceleration?
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- Getting Started with Intermittent Computing
- Intermittent Computing: Challenges and Opportunities
- GPU-Puzzles
- Making GPUs Actually Fast: A Deep Dive into Training Performance
- 16 charts that explain the AI boom
- Open Sustainable Technology
- Is Parallel Programming Hard, And, If So, What Can You Do About It?
- When a barrier does not block: The pitfalls of partial order
- High Bandwidth Flash: NAND’s Bid for AI Memory
- No Graphics API
- What I have been reading: What is a ml compiler
- Why are ML Compilers so Hard?
- How LLMs are trained for function calling
- Understanding Reasoning LLMs
- A Decade of Residuals: History & Effects on modern ML
- A Gentle Introduction to Distributed Training
- FPGA Internal Arch
- https://frankdenneman.ai/ai-infrastructure/
- https://tinyblog-phi.vercel.app/tinygrad