Module 1: What tinygrad does — and what happens when you use it

You're training a neural network on MNIST

Imagine you're building an app that recognizes handwritten digits. You want to train a model using tinygrad, a minimalist deep learning framework. You start with code that looks like this:

from tinygrad import Tensor, nn

class Model:
    def __init__(self):
        self.layers = [
            nn.Conv2d(1, 32, 5), Tensor.relu,
            nn.Conv2d(32, 32, 5), Tensor.relu,
            nn.BatchNorm(32), Tensor.max_pool2d,
            nn.Linear(576, 10)
        ]
    
    def __call__(self, x:Tensor) -> Tensor:
        return x.sequential(self.layers)

model = Model()
opt = nn.optim.Adam(nn.state.get_parameters(model), lr=0.001)

for epoch in range(10):
    # Forward pass
    logits = model(images)
    loss = logits.sparse_categorical_crossentropy(labels)
    
    # Backward pass
    opt.zero_grad()
    loss.backward()
    opt.step()

Plain English translation:

  • You define a Tensor model with layers
  • You create an optimizer to adjust the model's parameters
  • For each training step: you pass data through the model (forward), compute the loss, then update the parameters (backward + step)

Quick Check: Which operations compute gradients?

What happens under the hood?

When you run this code, tinygrad doesn't immediately execute operations. Instead, it builds a computation graph in memory, then optimizes and executes it later. Here's the journey:

Python Code
Build Graph (UOps)
Optimize & Fuse
Compile Kernels
Execute on GPU/CPU
# This line doesn't compute yet!
logits = model(images)

Behind the scenes:

  1. Each Tensor operation (conv2d, relu, etc.) creates a UOp (micro-operation) node
  2. These UOps are linked together forming a directed acyclic graph (DAG)
  3. No actual computation happens until you call .backward() or .realize()

This is called lazy evaluation — tinygrad waits until it sees the whole computation before running anything. This lets it optimize globally.

💡 Aha! Lazy Evaluation is Key

If tinygrad executed each operation immediately, it couldn't fuse operations or reorder them. By building the full graph first, tinygrad can see the entire computation and optimize it as a whole.

Realizing the computation

When you call loss.backward() or later when opt.step() needs gradients, tinygrad triggers realization:

# Inside loss.backward():
def backward(self):
    # 1. Build gradient graph
    # 2. Schedule execution
    # 3. Run kernels
    self.grad = self.backprop(self.uop)
Backward pass
Gradient
Chain Rule
Kernels

The backward pass uses the chain rule to compute gradients. tinygrad's autograd system has a pm_gradient pattern matcher that knows how to differentiate each operation type:

# From tinygrad/gradient.py
pm_gradient = PatternMatcher([
    (UPat(Ops.ADD), lambda ctx: (ctx, ctx)),           # d(a+b)/da = 1
    (UPat(Ops.MUL), lambda ctx, ret: (ret.src[1]*ctx, ret.src[0]*ctx)),  # d(ab)/da = b
    (UPat(Ops.RESHAPE), lambda ctx, ret: (ctx.reshape(ret.src[0].shape), None)),
    # ... more operations
])

🔧 Technical Deep Dive

UOps are the micro-operations that make up tinygrad's internal representation. Each operation like ADD, MUL, CONV2D becomes a UOp node with sources (inputs) and metadata.

Quiz: Understanding the flow

Question 1: When does tinygrad actually execute computations?

Question 2: What is the purpose of building a computation graph?