Ztorch Architecture

This document describes the internal architecture of Ztorch.

Overview

Ztorch is a compiler-based ML library. Models are defined at compile time (or runtime), converted to an internal representation (IR), optimized, and then compiled to backend-specific code.

Model Definition → IR → Optimization → Autograd → Backend Codegen → Execution

Components

1. IR (Internal Representation)

The IR is Ztorch's internal graph representation. It's the single source of truth for all transformations.

pub const Graph = struct {
    nodes: []Node,
    edges: []Edge,
    allocator: Allocator,
};

pub const Node = union(enum) {
    matmul: MatMulOp,
    relu: ActivationOp,
    softmax: SoftmaxOp,
    layernorm: LayerNormOp,
    // ... more ops
};

pub const Edge = struct {
    from: NodeId,
    to: NodeId,
    tensor_shape: Shape,
};

Design principles:

Immutable after construction (transformations create new graphs)
Validates shape compatibility at creation
Lightweight - can be copied/cloned cheaply

2. Frontends

Frontends convert external model formats to Ztorch IR.

Native Zig API (v0.1):

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),
    ztorch.ReLU(),
});

This comptime struct is converted to IR during compilation.

ONNX Import (v0.2+):

const graph = try ztorch.frontends.onnx.load("model.onnx");

3. Optimization Passes

Optimization passes transform the IR to improve performance.

v0.1 Optimizations:

Operator fusion (e.g., MatMul + ReLU → FusedMatMulReLU)
Constant folding
Dead code elimination
Memory layout optimization

Example:

Before: MatMul → ReLU → Softmax (3 kernel launches)
After:  FusedMatMulReLUSoftmax (1 kernel launch)

4. Autograd

The autograd system generates backward pass operations from forward pass IR.

Each operation has a gradient function:

pub const MatMulOp = struct {
    pub fn forward(a: Tensor, b: Tensor) Tensor { ... }

    pub fn backward(
        d_output: Tensor,
        a: Tensor,
        b: Tensor,
    ) struct { d_a: Tensor, d_b: Tensor } {
        // d_a = d_output @ b.T
        // d_b = a.T @ d_output
        ...
    }
};

The autograd pass walks the forward graph and generates the backward graph.

5. Backend Codegen

Backend codegen converts IR operations to executable code.

CPU Scalar (reference):

Direct Zig implementation
Simple, obviously correct
Used for verification

CPU SIMD:

Intrinsics for AVX2/AVX512 (x86)
Intrinsics for NEON (ARM)
Falls back to scalar if unsupported

CUDA:

Generates PTX assembly
Comptime specialization for shapes
Tensor core utilization

ROCm:

Generates LLVM IR
Similar to CUDA approach

Vulkan:

Generates SPIR-V
Portable across vendors

6. Runtime

The runtime manages:

Memory allocation (device buffers)
Kernel launching
Synchronization
Error handling

Memory management:

Static allocation during model compilation
No dynamic allocation during forward/backward
Explicit buffer reuse

Compilation Flow

Comptime Model Definition

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),
    ztorch.ReLU(),
    ztorch.Linear(128, 10),
});

// At comptime:
// 1. Type-check layer compatibility (128 matches between layers)
// 2. Build IR graph
// 3. Apply optimization passes
// 4. Generate backward pass

Compilation

var model = try Model.compile(.cuda, allocator);

// During compile():
// 1. Finalize IR (if not comptime)
// 2. Allocate device memory
// 3. Generate backend code (PTX)
// 4. Load kernels
// 5. Create execution plan

Execution

const output = try model.forward(input);

// During forward():
// 1. Copy input to device (if needed)
// 2. Launch fused kernels in sequence
// 3. Return output tensor

Data Structures

Tensor

pub const Tensor = struct {
    data: DevicePtr,
    shape: Shape,
    stride: Stride,
    dtype: DType,
    device: Device,
    requires_grad: bool,

    pub fn item(self: Tensor) f32 { ... }
    pub fn reshape(self: Tensor, new_shape: Shape) Tensor { ... }
    // ...
};

Shape

pub const Shape = struct {
    dims: [MAX_DIMS]usize,
    ndim: u8,

    pub fn numel(self: Shape) usize {
        var n: usize = 1;
        for (self.dims[0..self.ndim]) |d| n *= d;
        return n;
    }
};

Backend Interface

All backends implement the same interface:

pub const Backend = struct {
    vtable: *const VTable,
    context: *anyopaque,

    pub const VTable = struct {
        matmul: *const fn (*anyopaque, Tensor, Tensor) Tensor,
        relu: *const fn (*anyopaque, Tensor) Tensor,
        softmax: *const fn (*anyopaque, Tensor, usize) Tensor,
        // ... all ops
    };
};

This allows runtime backend selection and testing backend parity.

Performance Model

Napkin Math

Before implementing any operation, estimate its cost:

MatMul (M, K) @ (K, N):
- FLOPs: 2 * M * K * N
- Memory: (M*K + K*N + M*N) * sizeof(f32) bytes
- Arithmetic intensity: 2*M*K*N / (M*K + K*N + M*N)

Example: (1024, 1024) @ (1024, 1024)
- FLOPs: 2.15B
- Memory: 12 MB
- On RTX 4090 (82 TFLOPS, 1 TB/s):
  - Compute bound if > 82 FLOPs/byte ❌
  - Memory bound: 12MB / 1TB/s = 12µs
  - Actual should be ~12µs

Benchmarking

Every implementation is benchmarked:

=== MatMul 1024x1024 ===
CPU Scalar:       450ms (4.8 GFLOPS)
CPU AVX2:         112ms (19.2 GFLOPS) - 4.0x speedup
CUDA (RTX 4090):  0.5ms (4300 GFLOPS) - 900x speedup

Testing Strategy

See testing.md for full details.

Levels:

Unit tests (each op)
Backend parity (GPU matches CPU)
Gradient checks (numerical vs autograd)
Integration (full model training)

Future Architecture

Dynamic Shapes (v0.2)

Support runtime shape variation within bounds:

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),
    // ... batch size determined at runtime
});

Distributed (zbmd integration)

Ztorch provides the compute engine, zbmd provides fault-tolerant distribution.

Quantization (v0.3)

Support int8, fp16, bfloat16 for inference acceleration.

Keyboard shortcuts

Ztorch Documentation