tinygrad course - How tinygrad works under the hood

You're training a neural network on MNIST

Imagine you're building an app that recognizes handwritten digits. You want to train a model using tinygrad, a minimalist deep learning framework. You start with code that looks like this:

from tinygrad import Tensor, nn

class Model:
    def __init__(self):
        self.layers = [
            nn.Conv2d(1, 32, 5), Tensor.relu,
            nn.Conv2d(32, 32, 5), Tensor.relu,
            nn.BatchNorm(32), Tensor.max_pool2d,
            nn.Linear(576, 10)
        ]
    
    def __call__(self, x:Tensor) -> Tensor:
        return x.sequential(self.layers)

model = Model()
opt = nn.optim.Adam(nn.state.get_parameters(model), lr=0.001)

for epoch in range(10):
    # Forward pass
    logits = model(images)
    loss = logits.sparse_categorical_crossentropy(labels)
    
    # Backward pass
    opt.zero_grad()
    loss.backward()
    opt.step()

Plain English translation:

You define a Tensor model with layers
You create an optimizer to adjust the model's parameters
For each training step: you pass data through the model (forward), compute the loss, then update the parameters (backward + step)

Quick Check: Which operations compute gradients?

a) model(images) and loss.backward() b) Only loss.backward() c) Only opt.step()

What happens under the hood?

When you run this code, tinygrad doesn't immediately execute operations. Instead, it builds a computation graph in memory, then optimizes and executes it later. Here's the journey:

Python Code

→

Build Graph (UOps)

→

Optimize & Fuse

→

Compile Kernels

→

Execute on GPU/CPU

# This line doesn't compute yet!
logits = model(images)

Behind the scenes:

Each Tensor operation (conv2d, relu, etc.) creates a UOp (micro-operation) node
These UOps are linked together forming a directed acyclic graph (DAG)
No actual computation happens until you call .backward() or .realize()

This is called lazy evaluation — tinygrad waits until it sees the whole computation before running anything. This lets it optimize globally.

💡 Aha! Lazy Evaluation is Key

If tinygrad executed each operation immediately, it couldn't fuse operations or reorder them. By building the full graph first, tinygrad can see the entire computation and optimize it as a whole.

Realizing the computation

When you call loss.backward() or later when opt.step() needs gradients, tinygrad triggers realization:

# Inside loss.backward():
def backward(self):
    # 1. Build gradient graph
    # 2. Schedule execution
    # 3. Run kernels
    self.grad = self.backprop(self.uop)

Backward pass

Gradient

Chain Rule

Kernels

The backward pass uses the chain rule to compute gradients. tinygrad's autograd system has a pm_gradient pattern matcher that knows how to differentiate each operation type:

# From tinygrad/gradient.py
pm_gradient = PatternMatcher([
    (UPat(Ops.ADD), lambda ctx: (ctx, ctx)),           # d(a+b)/da = 1
    (UPat(Ops.MUL), lambda ctx, ret: (ret.src[1]*ctx, ret.src[0]*ctx)),  # d(ab)/da = b
    (UPat(Ops.RESHAPE), lambda ctx, ret: (ctx.reshape(ret.src[0].shape), None)),
    # ... more operations
])

🔧 Technical Deep Dive

UOps are the micro-operations that make up tinygrad's internal representation. Each operation like ADD, MUL, CONV2D becomes a UOp node with sources (inputs) and metadata.

Quiz: Understanding the flow

Question 1: When does tinygrad actually execute computations?

a) As soon as you call a Tensor method b) When you call .backward() or .realize() c) Never; it's purely symbolic

Question 2: What is the purpose of building a computation graph?

a) To enable lazy evaluation and global optimizations b) To make debugging easier c) It's just an internal detail; it doesn't help

Module 1: What tinygrad does — and what happens when you use it

You're training a neural network on MNIST

What happens under the hood?

💡 Aha! Lazy Evaluation is Key

Realizing the computation

🔧 Technical Deep Dive

Quiz: Understanding the flow