Ztorch

Pure Zig machine learning library with comptime optimization and multiple backends

Ztorch is a high-performance ML library built from the ground up in Zig, featuring:

🎯 Comptime graph optimization - Fuse operations and generate specialized kernels at compile time
🚀 Multiple backends - CPU (scalar & SIMD), CUDA, ROCm, Vulkan
🛡️ Tiger Style development - Safety first, benchmarked performance, zero technical debt
🧪 Test-driven - TDD from day 0, tested on Linux/macOS/Windows, x86_64/aarch64
🔧 Clean API - Define models as Zig structs, explicit and ergonomic

Status

v0.1-dev - Early development. Core architecture and CPU backend in progress.

✅ Project architecture defined
✅ Tiger Style development process established
🚧 CPU scalar backend (in progress)
⏳ CUDA backend (planned)
⏳ ROCm backend (planned)
⏳ Vulkan backend (planned)

Quick Start

const std = @import("std");
const ztorch = @import("ztorch");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    // Define model at comptime
    const Model = ztorch.Sequential(.{
        ztorch.Linear(784, 128),
        ztorch.ReLU(),
        ztorch.Linear(128, 10),
        ztorch.Softmax(),
    });

    // Compile for backend
    var model = try Model.compile(.cpu, gpa.allocator());
    defer model.deinit();

    // Train
    const input = try ztorch.Tensor.randn(.{32, 784});
    const labels = try ztorch.Tensor.randint(.{32}, 10);

    const output = try model.forward(input);
    const loss = try ztorch.crossEntropy(output, labels);
    try model.backward(loss);
    try model.step(.{ .adam = .{ .lr = 0.001 } });

    std.debug.print("Loss: {d:.4}\n", .{loss.item()});
}

Installation

Add to your build.zig.zon:

.dependencies = .{
    .ztorch = .{
        .url = "https://github.com/mattneel/ztorch/archive/refs/tags/v0.1.0.tar.gz",
        .hash = "...",
    },
},

In your build.zig:

const ztorch = b.dependency("ztorch", .{
    .target = target,
    .optimize = optimize,
});
exe.root_module.addImport("ztorch", ztorch.module("ztorch"));

Why Ztorch?

Comptime optimization: Your model structure is known at compile time, so Ztorch can fuse operations, eliminate overhead, and generate optimal kernels for your exact architecture.

Multiple backends: Write once, run on CPU, CUDA, ROCm, or Vulkan. Each backend is optimized for its target.

Proven correct: Every operation is tested against a reference implementation. GPU backends are verified against CPU. No surprises.

Benchmarked: Every optimization is measured. We prove the gains with napkin math and benchmarks.

No magic: Explicit control flow, static allocation, clear error handling. You know exactly what the machine is doing.

Building

# Run tests
zig build test

# Run benchmarks
zig build bench

# Build examples
zig build examples

# Check formatting
zig fmt --check .

Design Principles (Tiger Style)

Safety - Fixed limits, static allocation, explicit errors, fail fast
Performance - Napkin math first, benchmark everything, prove all gains
Developer Experience - Clean API, clear names, good documentation
Zero Technical Debt - TDD from day 0, do it right the first time

Ztorch Architecture

This document describes the internal architecture of Ztorch.

Overview

Ztorch is a compiler-based ML library. Models are defined at compile time (or runtime), converted to an internal representation (IR), optimized, and then compiled to backend-specific code.

Model Definition → IR → Optimization → Autograd → Backend Codegen → Execution

Components

1. IR (Internal Representation)

The IR is Ztorch's internal graph representation. It's the single source of truth for all transformations.

pub const Graph = struct {
    nodes: []Node,
    edges: []Edge,
    allocator: Allocator,
};

pub const Node = union(enum) {
    matmul: MatMulOp,
    relu: ActivationOp,
    softmax: SoftmaxOp,
    layernorm: LayerNormOp,
    // ... more ops
};

pub const Edge = struct {
    from: NodeId,
    to: NodeId,
    tensor_shape: Shape,
};

Design principles:

Immutable after construction (transformations create new graphs)
Validates shape compatibility at creation
Lightweight - can be copied/cloned cheaply

2. Frontends

Frontends convert external model formats to Ztorch IR.

Native Zig API (v0.1):

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),
    ztorch.ReLU(),
});

This comptime struct is converted to IR during compilation.

ONNX Import (v0.2+):

const graph = try ztorch.frontends.onnx.load("model.onnx");

3. Optimization Passes

Optimization passes transform the IR to improve performance.

v0.1 Optimizations:

Operator fusion (e.g., MatMul + ReLU → FusedMatMulReLU)
Constant folding
Dead code elimination
Memory layout optimization

Example:

Before: MatMul → ReLU → Softmax (3 kernel launches)
After:  FusedMatMulReLUSoftmax (1 kernel launch)

4. Autograd

The autograd system generates backward pass operations from forward pass IR.

Each operation has a gradient function:

pub const MatMulOp = struct {
    pub fn forward(a: Tensor, b: Tensor) Tensor { ... }

    pub fn backward(
        d_output: Tensor,
        a: Tensor,
        b: Tensor,
    ) struct { d_a: Tensor, d_b: Tensor } {
        // d_a = d_output @ b.T
        // d_b = a.T @ d_output
        ...
    }
};

The autograd pass walks the forward graph and generates the backward graph.

5. Backend Codegen

Backend codegen converts IR operations to executable code.

CPU Scalar (reference):

Direct Zig implementation
Simple, obviously correct
Used for verification

CPU SIMD:

Intrinsics for AVX2/AVX512 (x86)
Intrinsics for NEON (ARM)
Falls back to scalar if unsupported

CUDA:

Generates PTX assembly
Comptime specialization for shapes
Tensor core utilization

ROCm:

Generates LLVM IR
Similar to CUDA approach

Vulkan:

Generates SPIR-V
Portable across vendors

6. Runtime

The runtime manages:

Memory allocation (device buffers)
Kernel launching
Synchronization
Error handling

Memory management:

Static allocation during model compilation
No dynamic allocation during forward/backward
Explicit buffer reuse

Compilation Flow

Comptime Model Definition

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),
    ztorch.ReLU(),
    ztorch.Linear(128, 10),
});

// At comptime:
// 1. Type-check layer compatibility (128 matches between layers)
// 2. Build IR graph
// 3. Apply optimization passes
// 4. Generate backward pass

Compilation

var model = try Model.compile(.cuda, allocator);

// During compile():
// 1. Finalize IR (if not comptime)
// 2. Allocate device memory
// 3. Generate backend code (PTX)
// 4. Load kernels
// 5. Create execution plan

Execution

const output = try model.forward(input);

// During forward():
// 1. Copy input to device (if needed)
// 2. Launch fused kernels in sequence
// 3. Return output tensor

Data Structures

Tensor

pub const Tensor = struct {
    data: DevicePtr,
    shape: Shape,
    stride: Stride,
    dtype: DType,
    device: Device,
    requires_grad: bool,

    pub fn item(self: Tensor) f32 { ... }
    pub fn reshape(self: Tensor, new_shape: Shape) Tensor { ... }
    // ...
};

Shape

pub const Shape = struct {
    dims: [MAX_DIMS]usize,
    ndim: u8,

    pub fn numel(self: Shape) usize {
        var n: usize = 1;
        for (self.dims[0..self.ndim]) |d| n *= d;
        return n;
    }
};

Backend Interface

All backends implement the same interface:

pub const Backend = struct {
    vtable: *const VTable,
    context: *anyopaque,

    pub const VTable = struct {
        matmul: *const fn (*anyopaque, Tensor, Tensor) Tensor,
        relu: *const fn (*anyopaque, Tensor) Tensor,
        softmax: *const fn (*anyopaque, Tensor, usize) Tensor,
        // ... all ops
    };
};

This allows runtime backend selection and testing backend parity.

Performance Model

Napkin Math

Before implementing any operation, estimate its cost:

MatMul (M, K) @ (K, N):
- FLOPs: 2 * M * K * N
- Memory: (M*K + K*N + M*N) * sizeof(f32) bytes
- Arithmetic intensity: 2*M*K*N / (M*K + K*N + M*N)

Example: (1024, 1024) @ (1024, 1024)
- FLOPs: 2.15B
- Memory: 12 MB
- On RTX 4090 (82 TFLOPS, 1 TB/s):
  - Compute bound if > 82 FLOPs/byte ❌
  - Memory bound: 12MB / 1TB/s = 12µs
  - Actual should be ~12µs

Benchmarking

Every implementation is benchmarked:

=== MatMul 1024x1024 ===
CPU Scalar:       450ms (4.8 GFLOPS)
CPU AVX2:         112ms (19.2 GFLOPS) - 4.0x speedup
CUDA (RTX 4090):  0.5ms (4300 GFLOPS) - 900x speedup

Testing Strategy

See testing.md for full details.

Levels:

Unit tests (each op)
Backend parity (GPU matches CPU)
Gradient checks (numerical vs autograd)
Integration (full model training)

Future Architecture

Dynamic Shapes (v0.2)

Support runtime shape variation within bounds:

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),
    // ... batch size determined at runtime
});

Distributed (zbmd integration)

Ztorch provides the compute engine, zbmd provides fault-tolerant distribution.

Quantization (v0.3)

Support int8, fp16, bfloat16 for inference acceleration.

Ztorch API Reference

Complete API documentation for Ztorch v0.1.

Core Types

Tensor

The fundamental data structure.

pub const Tensor = struct {
    data: DevicePtr,
    shape: Shape,
    stride: Stride,
    dtype: DType,
    device: Device,
    requires_grad: bool,
};

Creation

// Zeros
const t = try Tensor.zeros(.{32, 128}, .f32, .cpu);

// Ones
const t = try Tensor.ones(.{32, 128}, .f32, .cpu);

// Random normal distribution
const t = try Tensor.randn(.{32, 128}, .f32, .cpu);

// Random integers [0, high)
const t = try Tensor.randint(.{32}, 10, .cpu);

// From slice
const data = [_]f32{ 1, 2, 3, 4 };
const t = try Tensor.fromSlice(.{2, 2}, &data, .cpu);

Operations

// Reshape (view, no copy)
const reshaped = try t.reshape(.{64, 64});

// Transpose
const transposed = try t.transpose();

// Get single item (must be 1-element tensor)
const value: f32 = t.item();

// To slice (copies to CPU)
var buffer: [4]f32 = undefined;
try t.toSlice(&buffer);

Shape

pub const Shape = struct {
    dims: [MAX_DIMS]usize,
    ndim: u8,
};

// Create shape
const shape = Shape.init(&[_]usize{ 32, 128, 256 });

// Number of elements
const n = shape.numel(); // 32 * 128 * 256

Device

pub const Device = enum {
    cpu,
    cuda,
    rocm,
    vulkan,
};

DType

pub const DType = enum {
    f32,
    f64,
    i32,
    i64,
    // more types in future versions
};

Model Definition

Sequential

Build a sequential model from layers.

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),
    ztorch.ReLU(),
    ztorch.Linear(128, 64),
    ztorch.ReLU(),
    ztorch.Linear(64, 10),
});

Layers

Linear

Fully connected layer: y = x @ W.T + b

pub fn Linear(comptime in_features: usize, comptime out_features: usize) type

Example:

ztorch.Linear(784, 128) // 784 → 128

Activations

ReLU: y = max(0, x)

ztorch.ReLU()

GELU: Gaussian Error Linear Unit

ztorch.GELU()

Softmax: y[i] = exp(x[i]) / sum(exp(x))

ztorch.Softmax(.{.dim = -1}) // softmax over last dimension

Normalization

LayerNorm:

ztorch.LayerNorm(normalized_shape, .{
    .eps = 1e-5,
    .elementwise_affine = true,
})

Utilities

Data Loading

Helpers for turning binary dumps into tensors. Especially useful with the MNIST preparation script.

const cwd = std.fs.cwd();
var images = try ztorch.data.mnist.loadImages(cwd, "data/mnist_train_x.bin", allocator, .{
    .max_samples = 1024, // 0 = entire file
});
defer images.deinit();

var labels = try ztorch.data.mnist.loadLabels(cwd, "data/mnist_train_y.bin", images.shape.dims[0], allocator, .{});
defer labels.deinit();

var iter = try ztorch.data.BatchIterator.init(&images, &labels, allocator, .{
    .batch_size = 128,
    .shuffle = true,
    .seed = 42,
});
defer iter.deinit();

while (try iter.next()) |batch| {
    var owned = batch;
    defer owned.deinit();
    // use owned.inputs / owned.labels.? tensors
}

Parameter Initializers

Create trainable tensors with the right gradient flags.

var weights = try ztorch.init.uniformParam(&[_]usize{ 784, 128 }, 0.05, allocator, 1);
var bias = try ztorch.init.zerosParam(&[_]usize{ 1, 128 }, allocator);

Metrics

Compute classification accuracy from logits or probabilities.

const train_acc = try ztorch.metrics.accuracyFromLogits(&logits, &labels, allocator);

var probs = try ztorch.ops.activations.softmax_cpu_scalar(&logits, -1, allocator);
defer probs.deinit();
const val_acc = try ztorch.metrics.accuracyFromProbabilities(&probs, &labels);

Checkpointing

Persist model parameters to disk and restore them later.

const entries = [_]ztorch.checkpoint.NamedTensor{
    .{ .name = "w1", .tensor = &w1 },
    .{ .name = "b1", .tensor = &b1 },
};

try ztorch.checkpoint.save("model.ckpt", entries[0..]);

var checkpoint = try ztorch.checkpoint.load("model.ckpt", allocator);
defer checkpoint.deinit();

const weights = checkpoint.get("w1") orelse unreachable;

MNIST Helpers

Utility functions for the built-in two-layer MNIST classifier.

const accuracy = try ztorch.models.mnist.evaluate(
    &test_images,
    &test_labels,
    &w1,
    &b1,
    &w2,
    &b2,
    allocator,
);

Compilation

pub fn compile(
    comptime self: anytype,
    comptime backend: Device,
    allocator: Allocator,
) !CompiledModel

Example:

const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 10),
});

var model = try Model.compile(.cpu, allocator);
defer model.deinit();

Forward Pass

pub fn forward(self: *CompiledModel, input: Tensor) !Tensor

Example:

const input = try Tensor.randn(.{32, 784}, .f32, .cpu);
const output = try model.forward(input);

Loss Functions

Cross Entropy

pub fn crossEntropy(predictions: Tensor, targets: Tensor) !Tensor

Example:

const output = try model.forward(input);
const loss = try ztorch.crossEntropy(output, labels);

Mean Squared Error

pub fn mse(predictions: Tensor, targets: Tensor) !Tensor

Backward Pass

pub fn backward(self: *CompiledModel, loss: Tensor) !void

Computes gradients for all parameters with requires_grad = true.

Example:

const loss = try ztorch.crossEntropy(output, labels);
try model.backward(loss);

Optimization

Optimizer Step

pub fn step(self: *CompiledModel, config: OptimizerConfig) !void

pub const OptimizerConfig = union(enum) {
    sgd: SGDConfig,
    adam: AdamConfig,
};

SGD

pub const SGDConfig = struct {
    lr: f32,
    momentum: f32 = 0.0,
    weight_decay: f32 = 0.0,
};

try model.step(.{ .sgd = .{ .lr = 0.01, .momentum = 0.9 } });

Adam

pub const AdamConfig = struct {
    lr: f32,
    beta1: f32 = 0.9,
    beta2: f32 = 0.999,
    eps: f32 = 1e-8,
    weight_decay: f32 = 0.0,
};

try model.step(.{ .adam = .{ .lr = 0.001 } });

Complete Training Example

const std = @import("std");
const ztorch = @import("ztorch");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Define model
    const Model = ztorch.Sequential(.{
        ztorch.Linear(784, 128),
        ztorch.ReLU(),
        ztorch.Linear(128, 64),
        ztorch.ReLU(),
        ztorch.Linear(64, 10),
    });

    // Compile
    var model = try Model.compile(.cpu, allocator);
    defer model.deinit();

    // Training loop
    const epochs = 10;
    const batch_size = 32;

    for (0..epochs) |epoch| {
        var total_loss: f32 = 0;
        var num_batches: usize = 0;

        // ... load batch ...
        const input = try Tensor.randn(.{batch_size, 784}, .f32, .cpu);
        const labels = try Tensor.randint(.{batch_size}, 10, .cpu);

        // Forward
        const output = try model.forward(input);

        // Loss
        const loss = try ztorch.crossEntropy(output, labels);
        total_loss += loss.item();
        num_batches += 1;

        // Backward
        try model.backward(loss);

        // Update
        try model.step(.{ .adam = .{ .lr = 0.001 } });

        std.debug.print("Epoch {}: Loss = {d:.4}\n", .{
            epoch,
            total_loss / @as(f32, @floatFromInt(num_batches)),
        });
    }
}

Backend-Specific Features

CUDA

// Use CUDA backend
var model = try Model.compile(.cuda, allocator);

// CUDA-specific device selection (future)
// ztorch.cuda.setDevice(0);

Memory Management

// Models manage their own memory
var model = try Model.compile(.cpu, allocator);
defer model.deinit(); // Frees all tensors

// Explicit tensor lifetime
const t = try Tensor.zeros(.{100}, .f32, .cpu);
defer t.deinit();

Error Handling

All fallible operations return errors:

pub const Error = error{
    OutOfMemory,
    DeviceError,
    ShapeMismatch,
    InvalidDType,
    InvalidDevice,
    BackendNotSupported,
};

Example:

const output = model.forward(input) catch |err| {
    std.debug.print("Forward pass failed: {}\n", .{err});
    return err;
};

Ztorch Operations Catalog

Complete specification of all operations in Ztorch.

Each operation includes:

Mathematical definition
Forward pass algorithm
Backward pass (gradient) algorithm
Implementation status
Test requirements
Performance characteristics

Matrix Operations

MatMul

Status: ✅ Reference CPU

Matrix multiplication: C = A @ B

Shapes:

A: (M, K)
B: (K, N)
C: (M, N)

Forward:

C[i,j] = sum_k(A[i,k] * B[k,j])

Backward:

d_A[i,k] = sum_j(d_C[i,j] * B[k,j])
d_B[k,j] = sum_i(A[i,k] * d_C[i,j])

Simplified:
d_A = d_C @ B.T
d_B = A.T @ d_C

FLOPs: 2 * M * K * N

Memory: (M*K + K*N + M*N) * sizeof(dtype)

Tests Required:

Identity matrix
Known result (2x2, 3x3)
Large matrices (1024x1024)
Non-square matrices
Gradient check

Backend Status:

CPU Scalar: ✅ Implemented (matmul_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Benchmarks:

bench/ops/matmul.zig (included in zig build bench)

Transpose

Status: ✅ Reference CPU

Matrix transpose: B = A.T

Shapes:

A: (M, N)
B: (N, M)

Forward:

B[i,j] = A[j,i]

Backward:

d_A = d_B.T

Implementation Note: Can be a view (no data copy) with stride adjustment.

Activations

ReLU

Status: ✅ Reference CPU

Rectified Linear Unit: y = max(0, x)

Forward:

y[i] = max(0, x[i])

Backward:

d_x[i] = d_y[i] * (x[i] > 0 ? 1 : 0)

FLOPs: N (comparisons)

Tests Required:

All positive input
All negative input
Mixed positive/negative
Zero values
Gradient check

Backend Status:

CPU Scalar: ✅ Implemented (relu_cpu_scalar, relu_backward_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Benchmarks:

bench/ops/activations.zig (included in zig build bench)

PTX Implementation (reference):

// y = max(0, x)
max.f32 %out, %in, 0.0

GELU

Status: ⏳ Planned

Gaussian Error Linear Unit (approximation):

y = 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))

Backward: (complex, see implementation)

FLOPs: ~10N (approximate)

Softmax

Status: ✅ Reference CPU

Softmax over dimension d:

y[i] = exp(x[i] - max(x)) / sum_j(exp(x[j] - max(x)))

Why subtract max: Numerical stability (prevent overflow)

Forward Algorithm:

1. max_val = max(x)
2. exp_vals[i] = exp(x[i] - max_val)
3. sum_exp = sum(exp_vals)
4. y[i] = exp_vals[i] / sum_exp

Backward: (complex Jacobian, see implementation)

FLOPs: ~5N

Tests Required:

Uniform input (all same value)
One large value (should be ~1.0)
Output sums to 1.0
Gradient check

Backend Status:

CPU Scalar: ✅ Implemented (softmax_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Normalization

LayerNorm

Status: ⏳ Planned

Layer normalization:

y = (x - mean) / sqrt(var + eps) * gamma + beta

Forward Algorithm:

1. mean = sum(x) / N
2. var = sum((x - mean)^2) / N
3. x_norm = (x - mean) / sqrt(var + eps)
4. y = gamma * x_norm + beta

Backward: (chain rule through all operations)

FLOPs: ~5N

Parameters:

gamma: learned scale (shape: normalized_shape)
beta: learned bias (shape: normalized_shape)

Tests Required:

Known mean/variance
Learnable parameters
Gradient check

BatchNorm

Status: ⏳ Planned (v0.2)

Batch normalization (more complex due to training/inference modes)

Loss Functions

CrossEntropy

Status: ✅ Reference CPU

Cross-entropy loss over class logits with numerical stabilization:

logsumexp = log(sum(exp(logits - max(logits))))
loss = mean(logsumexp - logits[range, target])

Backward:

d_logits = (softmax(logits) - one_hot(target)) / batch_size

Backend Status:

CPU Scalar: ✅ Implemented (cross_entropy_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Tests:

tests/ops/loss.zig

Forward Algorithm:

1. Apply softmax to predictions
2. loss = -log(probs[target_class])

Backward:

d_pred[i] = probs[i] - (i == target ? 1 : 0)

Tests Required:

Perfect prediction (loss ~ 0)
Random prediction (loss ~ log(num_classes))
Gradient check

MSE

Status: ⏳ Planned

Mean squared error:

loss = mean((pred - target)^2)

Forward:

loss = sum((pred[i] - target[i])^2) / N

Backward:

d_pred[i] = 2 * (pred[i] - target[i]) / N

Element-wise Operations

Add

z = x + y

Shapes: Broadcasting supported

Forward: z[i] = x[i] + y[i]

Backward: d_x = d_z, d_y = d_z

Multiply

z = x * y

Forward: z[i] = x[i] * y[i]

Backward: d_x = d_z * y, d_y = d_z * x

Exp

y = exp(x)

Forward: y[i] = exp(x[i])

Backward: d_x[i] = d_y[i] * y[i]

Log

y = log(x)

Forward: y[i] = log(x[i])

Backward: d_x[i] = d_y[i] / x[i]

Reduction Operations

Sum

Sum over dimension(s)

Forward: y = sum(x, dim)

Backward: Broadcast d_y to shape of x

Mean

Mean over dimension(s)

Forward: y = sum(x, dim) / N

Backward: Broadcast d_y / N to shape of x

Max

Max over dimension(s)

Forward: y = max(x, dim)

Backward: Gradient flows only to max element (argmax mask)

Testing Requirements

Every operation must have:

Unit tests with known values

   test "matmul: 2x2 known result" {
       // [[1,2],[3,4]] @ [[5,6],[7,8]] = [[19,22],[43,50]]
   }

Gradient checks

   test "matmul: gradient check" {
       // Compare autograd gradient vs numerical gradient
   }

Backend parity tests

   test "matmul: cpu vs cuda" {
       // Verify GPU output matches CPU (within epsilon)
   }

Benchmarks

   bench "matmul: 1024x1024" {
       // Measure GFLOPS
   }

Optimizers

SGD

Status: ✅ Reference CPU

Stochastic Gradient Descent with constant learning rate.

Update Rule:

param -= lr * grad

Backend Status:

CPU Scalar: ✅ (optim/sgd.zig)
CPU SIMD: ⏳
CUDA: ⏳

Tests:

tests/integration/xor.zig

Implementation Checklist

For each operation:

Mathematical definition documented
Forward pass implemented (CPU scalar)
Backward pass implemented (CPU scalar)
Unit tests with known values
Gradient check tests
Benchmarked (baseline)
CPU SIMD implementation
CUDA implementation
Backend parity tests
Performance validated (vs napkin math)

Future Operations (v0.2+)

Conv2D
MaxPool2D
Dropout
Embedding
Attention (fused)
RMSNorm

Ztorch Backends

Backend implementation details and architecture.

Overview

Ztorch supports multiple backends, each optimized for different hardware:

CPU Scalar - Reference implementation, obviously correct
CPU SIMD - Vectorized with AVX2/AVX512 (x86) or NEON (ARM)
CUDA - NVIDIA GPUs via PTX assembly
ROCm - AMD GPUs via LLVM IR
Vulkan - Cross-vendor via SPIR-V

Backend Interface

All backends implement the same interface:

pub const Backend = struct {
    vtable: *const VTable,
    context: *anyopaque,

    pub const VTable = struct {
        // Lifecycle
        init: *const fn (Allocator) anyerror!*anyopaque,
        deinit: *const fn (*anyopaque) void,

        // Memory
        alloc: *const fn (*anyopaque, usize) anyerror!DevicePtr,
        free: *const fn (*anyopaque, DevicePtr) void,
        copy_to_device: *const fn (*anyopaque, []const u8, DevicePtr) anyerror!void,
        copy_from_device: *const fn (*anyopaque, DevicePtr, []u8) anyerror!void,

        // Operations
        matmul: *const fn (*anyopaque, Tensor, Tensor) anyerror!Tensor,
        relu: *const fn (*anyopaque, Tensor) anyerror!Tensor,
        softmax: *const fn (*anyopaque, Tensor, usize) anyerror!Tensor,
        // ... all ops
    };
};

CPU Scalar Backend

Purpose: Reference implementation for correctness.

Characteristics:

Simple, readable code
No SIMD, no assembly
Serves as ground truth for testing

Example: MatMul

fn matmul_cpu_scalar(a: Tensor, b: Tensor, c: *Tensor) void {
    const M = a.shape.dims[0];
    const K = a.shape.dims[1];
    const N = b.shape.dims[1];

    for (0..M) |i| {
        for (0..N) |j| {
            var sum: f32 = 0;
            for (0..K) |k| {
                sum += a.data[i * K + k] * b.data[k * N + j];
            }
            c.data[i * N + j] = sum;
        }
    }
}

Expected Performance:

MatMul 1024x1024: ~5 GFLOPS (single core)

CPU SIMD Backend

Purpose: Optimized CPU implementation using vector instructions.

Implementation:

x86: AVX2 (256-bit) or AVX-512 (512-bit) intrinsics
ARM: NEON (128-bit) intrinsics
Runtime detection of CPU capabilities
Fallback to scalar if unsupported

Example: ReLU with AVX2

fn relu_cpu_avx2(input: []f32, output: []f32) void {
    const zero = @Vector(8, f32){0, 0, 0, 0, 0, 0, 0, 0};

    var i: usize = 0;
    while (i + 8 <= input.len) : (i += 8) {
        const vec: @Vector(8, f32) = input[i..][0..8].*;
        const result = @maximum(vec, zero);
        output[i..][0..8].* = result;
    }

    // Handle remaining elements
    while (i < input.len) : (i += 1) {
        output[i] = @maximum(input[i], 0);
    }
}

Expected Performance:

4-8x speedup over scalar (depending on operation)
MatMul 1024x1024: ~20-40 GFLOPS (single core)

CUDA Backend

Purpose: High-performance execution on NVIDIA GPUs.

Architecture:

Generate PTX assembly at comptime
Load via CUDA Driver API
Use tensor cores when available

Code Generation:

fn generateMatMulPTX(
    comptime M: usize,
    comptime N: usize,
    comptime K: usize,
) [:0]const u8 {
    comptime {
        const tile_size = 32;

        var ptx: []const u8 = ".version 8.5\n";
        ptx = ptx ++ ".target sm_80\n"; // Ampere
        ptx = ptx ++ ".address_size 64\n\n";

        // Generate tiled matmul kernel
        // Use wmma.mma for tensor cores
        // ...

        return ptx;
    }
}

Optimization Techniques:

Tiling for shared memory
Tensor core usage (wmma instructions)
Memory coalescing
Bank conflict avoidance

Expected Performance:

RTX 4090: 4000+ GFLOPS for MatMul
Limited by memory bandwidth for smaller ops

Tensor Cores:

// Load into matrix fragments
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_a, [%ptr_a];
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_b, [%ptr_b];

// Multiply-accumulate
wmma.mma.sync.aligned.m16n16k16.f32.f32 %frag_c, %frag_a, %frag_b, %frag_c;

// Store result
wmma.store.sync.aligned.m16n16k16.global.f32 [%ptr_c], %frag_c;

ROCm Backend

Purpose: Support for AMD GPUs.

Architecture:

Generate LLVM IR at comptime
Compile via ROCm HIP toolchain
Similar optimization strategies to CUDA

Code Generation:

fn generateMatMulLLVM(
    comptime M: usize,
    comptime N: usize,
    comptime K: usize,
) [:0]const u8 {
    comptime {
        var llvm: []const u8 = "";

        // LLVM IR for tiled matmul
        llvm = llvm ++ "define void @matmul(...) {\n";
        // ...
        llvm = llvm ++ "}\n";

        return llvm;
    }
}

Expected Performance:

Similar to CUDA (depends on GPU model)
MI300X: 5000+ GFLOPS for MatMul

Vulkan Backend

Purpose: Cross-vendor GPU support (NVIDIA, AMD, Intel).

Architecture:

Generate SPIR-V at comptime
Use Vulkan compute shaders
Portable across all GPUs

Trade-offs:

More portable but less optimized than native backends
No tensor core access (uses FP32 MAD operations)
Good for inference, adequate for training

Code Generation:

fn generateMatMulSPIRV(
    comptime M: usize,
    comptime N: usize,
    comptime K: usize,
) []const u8 {
    comptime {
        // SPIR-V assembly
        var spirv: []const u8 = "";

        // OpTypeFloat, OpTypeVector, etc.
        // OpMatrixTimesVector or manual loop
        // ...

        return spirv;
    }
}

Expected Performance:

50-80% of native backend performance
Good enough for most inference workloads

Backend Selection

Compile-time

const Model = ztorch.Sequential(.{ /* ... */ });
var model = try Model.compile(.cuda, allocator); // Fixed at compile time

Runtime

const backend: Device = if (cuda_available) .cuda else .cpu;
var model = try Model.compile(backend, allocator);

Auto-detection

const backend = try ztorch.selectBestBackend(); // Detects available hardware
var model = try Model.compile(backend, allocator);

Testing Backend Parity

All backends must produce identical results (within floating-point precision).

test "matmul: cpu vs cuda parity" {
    const input_a = try Tensor.randn(.{32, 64}, .f32, .cpu);
    const input_b = try Tensor.randn(.{64, 128}, .f32, .cpu);

    // CPU result
    const output_cpu = try ops.matmul_cpu(input_a, input_b);

    // CUDA result
    const input_a_gpu = try input_a.to(.cuda);
    const input_b_gpu = try input_b.to(.cuda);
    const output_cuda = try ops.matmul_cuda(input_a_gpu, input_b_gpu);
    const output_cuda_cpu = try output_cuda.to(.cpu);

    // Compare
    try testing.expectApproxEqSlice(
        f32,
        output_cpu.data,
        output_cuda_cpu.data,
        1e-4, // epsilon
    );
}

Performance Validation

Every backend implementation must meet minimum performance requirements.

Napkin Math Target:

CPU Scalar: Baseline
CPU SIMD: >2x scalar
CUDA: >10x CPU for N>1024
ROCm: Similar to CUDA
Vulkan: >5x CPU

Example Validation:

bench "matmul: 1024x1024 performance check" {
    const result = try benchMatMul(.cuda, 1024, 1024, 1024);

    // RTX 4090 theoretical: 82 TFLOPS
    // 2*1024^3 FLOPs = 2.1B FLOPs
    // Minimum 50% efficiency: 41 TFLOPS = 51 µs
    try testing.expect(result.elapsed_ns < 51_000);
}

Future Backends

Metal (Apple Silicon) - v0.3
WebGPU (browser) - v0.4
CPU Multi-threaded - v0.2 (OpenMP-style)

Ztorch Testing Strategy

Comprehensive testing approach following Tiger Style principles.

Testing Philosophy

TDD from day 0 - Write tests before implementation
Test all platforms - Linux, macOS, Windows on x86_64 and aarch64
Reference-based verification - CPU scalar is ground truth
No untested code - Every line exercised by tests
Fail fast - Assertions in code, strict validation in tests

Test Categories

1. Unit Tests

Test individual operations with known inputs/outputs.

Location: test/ops/

Example:

// test/ops/matmul_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;

test "matmul: 2x2 identity matrix" {
    // I @ A = A
    const identity = [_]f32{ 1, 0, 0, 1 };
    const matrix = [_]f32{ 5, 6, 7, 8 };
    var result: [4]f32 = undefined;

    ztorch.ops.matmul_cpu(.{2, 2}, &identity, &matrix, &result);

    try testing.expectEqual(@as(f32, 5), result[0]);
    try testing.expectEqual(@as(f32, 6), result[1]);
    try testing.expectEqual(@as(f32, 7), result[2]);
    try testing.expectEqual(@as(f32, 8), result[3]);
}

test "matmul: 2x2 known result" {
    // [[1, 2], [3, 4]] @ [[5, 6], [7, 8]]
    // = [[1*5+2*7, 1*6+2*8], [3*5+4*7, 3*6+4*8]]
    // = [[19, 22], [43, 50]]
    const a = [_]f32{ 1, 2, 3, 4 };
    const b = [_]f32{ 5, 6, 7, 8 };
    var c: [4]f32 = undefined;

    ztorch.ops.matmul_cpu(.{2, 2}, &a, &b, &c);

    const epsilon = 1e-5;
    try testing.expectApproxEqAbs(@as(f32, 19), c[0], epsilon);
    try testing.expectApproxEqAbs(@as(f32, 22), c[1], epsilon);
    try testing.expectApproxEqAbs(@as(f32, 43), c[2], epsilon);
    try testing.expectApproxEqAbs(@as(f32, 50), c[3], epsilon);
}

test "matmul: non-square matrices" {
    // (2, 3) @ (3, 4) = (2, 4)
    const a = [_]f32{ 1, 2, 3, 4, 5, 6 };
    const b = [_]f32{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
    var c: [8]f32 = undefined;

    ztorch.ops.matmul_cpu(.{2, 3, 4}, &a, &b, &c);

    // Verify all elements (computed manually)
    const expected = [_]f32{ 38, 44, 50, 56, 83, 98, 113, 128 };
    const epsilon = 1e-5;
    for (expected, c) |exp, act| {
        try testing.expectApproxEqAbs(exp, act, epsilon);
    }
}

test "matmul: large matrix stress test" {
    const allocator = testing.allocator;
    const n = 1024;

    var a = try allocator.alloc(f32, n * n);
    defer allocator.free(a);
    var b = try allocator.alloc(f32, n * n);
    defer allocator.free(b);
    var c = try allocator.alloc(f32, n * n);
    defer allocator.free(c);

    // Fill with known pattern
    for (0..n*n) |i| {
        a[i] = @floatFromInt(i % 100);
        b[i] = @floatFromInt((i * 7) % 100);
    }

    // Should complete without error
    ztorch.ops.matmul_cpu(.{n, n}, a, b, c);

    // Spot check a few values (not full verification)
    try testing.expect(c[0] != 0);
    try testing.expect(!std.math.isNan(c[n*n - 1]));
}

Requirements:

At least 3 tests per operation (simple, known result, edge cases)
Cover edge cases (zeros, negatives, large values, etc.)
Test various sizes (small, medium, large)

2. Gradient Check Tests

Verify autograd correctness using numerical differentiation.

Location: test/autograd/

Example:

// test/autograd/gradient_check_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;

fn numericalGradient(
    comptime f: anytype,
    x: []f32,
    epsilon: f32,
) ![]f32 {
    const allocator = testing.allocator;
    var grad = try allocator.alloc(f32, x.len);

    for (0..x.len) |i| {
        // f(x + h)
        x[i] += epsilon;
        const f_plus = try f(x);

        // f(x - h)
        x[i] -= 2 * epsilon;
        const f_minus = try f(x);

        // (f(x+h) - f(x-h)) / 2h
        grad[i] = (f_plus - f_minus) / (2 * epsilon);

        // Restore
        x[i] += epsilon;
    }

    return grad;
}

test "matmul: gradient check" {
    const allocator = testing.allocator;

    // Small matrices for numerical stability
    const a = [_]f32{ 1, 2, 3, 4 };
    const b = [_]f32{ 5, 6, 7, 8 };

    // Forward pass
    var c: [4]f32 = undefined;
    ztorch.ops.matmul_cpu(.{2, 2}, &a, &b, &c);

    // Backward pass (autograd)
    const d_c = [_]f32{ 1, 1, 1, 1 }; // Gradient of loss
    var d_a: [4]f32 = undefined;
    var d_b: [4]f32 = undefined;
    ztorch.ops.matmul_backward_cpu(.{2, 2}, &d_c, &a, &b, &d_a, &d_b);

    // Numerical gradient
    var a_copy = a;
    const num_grad_a = try numericalGradient(
        struct {
            fn f(x: []f32) !f32 {
                var tmp: [4]f32 = undefined;
                ztorch.ops.matmul_cpu(.{2, 2}, x, &b, &tmp);
                return tmp[0] + tmp[1] + tmp[2] + tmp[3]; // sum
            }
        }.f,
        &a_copy,
        1e-4,
    );
    defer allocator.free(num_grad_a);

    // Compare autograd vs numerical
    const epsilon = 1e-3; // Numerical gradients are approximate
    for (d_a, num_grad_a) |auto_grad, num_grad| {
        try testing.expectApproxEqAbs(auto_grad, num_grad, epsilon);
    }
}

test "relu: gradient check" {
    const input = [_]f32{ -2, -1, 0, 1, 2 };
    var output: [5]f32 = undefined;

    // Forward
    ztorch.ops.relu_cpu(&input, &output);

    // Backward
    const d_output = [_]f32{ 1, 1, 1, 1, 1 };
    var d_input: [5]f32 = undefined;
    ztorch.ops.relu_backward_cpu(&d_output, &input, &d_input);

    // Expected gradients: ReLU'(x) = x > 0 ? 1 : 0
    const expected = [_]f32{ 0, 0, 0, 1, 1 };

    for (expected, d_input) |exp, act| {
        try testing.expectEqual(exp, act);
    }
}

Requirements:

Every differentiable operation must have gradient check
Use numerical differentiation as reference
Epsilon tolerance based on operation complexity

3. Backend Parity Tests

Verify all backends produce identical results.

Location: test/backends/

Example:

// test/backends/parity_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;

test "matmul: cpu scalar vs cpu simd" {
    if (!ztorch.cpu.hasSimd()) return error.SkipZigTest;

    const allocator = testing.allocator;
    const n = 256;

    // Random input
    var a = try allocator.alloc(f32, n * n);
    defer allocator.free(a);
    var b = try allocator.alloc(f32, n * n);
    defer allocator.free(b);

    var rng = std.rand.DefaultPrng.init(42);
    for (0..n*n) |i| {
        a[i] = rng.random().float(f32) * 2 - 1; // [-1, 1]
        b[i] = rng.random().float(f32) * 2 - 1;
    }

    // CPU scalar
    var c_scalar = try allocator.alloc(f32, n * n);
    defer allocator.free(c_scalar);
    ztorch.ops.matmul_cpu_scalar(.{n, n}, a, b, c_scalar);

    // CPU SIMD
    var c_simd = try allocator.alloc(f32, n * n);
    defer allocator.free(c_simd);
    ztorch.ops.matmul_cpu_simd(.{n, n}, a, b, c_simd);

    // Compare
    const epsilon = 1e-4; // Allow small numerical differences
    for (c_scalar, c_simd) |scalar, simd| {
        try testing.expectApproxEqAbs(scalar, simd, epsilon);
    }
}

test "matmul: cpu vs cuda" {
    if (!ztorch.cuda.isAvailable()) return error.SkipZigTest;

    const allocator = testing.allocator;
    const n = 1024;

    // Random input
    var a = try allocator.alloc(f32, n * n);
    defer allocator.free(a);
    var b = try allocator.alloc(f32, n * n);
    defer allocator.free(b);

    var rng = std.rand.DefaultPrng.init(42);
    for (0..n*n) |i| {
        a[i] = rng.random().float(f32) * 2 - 1;
        b[i] = rng.random().float(f32) * 2 - 1;
    }

    // CPU result
    var c_cpu = try allocator.alloc(f32, n * n);
    defer allocator.free(c_cpu);
    ztorch.ops.matmul_cpu(.{n, n}, a, b, c_cpu);

    // CUDA result
    const a_gpu = try ztorch.cuda.allocAndCopy(a);
    defer ztorch.cuda.free(a_gpu);
    const b_gpu = try ztorch.cuda.allocAndCopy(b);
    defer ztorch.cuda.free(b_gpu);
    const c_gpu = try ztorch.cuda.alloc(n * n * @sizeOf(f32));
    defer ztorch.cuda.free(c_gpu);

    try ztorch.ops.matmul_cuda(.{n, n}, a_gpu, b_gpu, c_gpu);

    var c_cuda = try allocator.alloc(f32, n * n);
    defer allocator.free(c_cuda);
    try ztorch.cuda.copyToHost(c_gpu, c_cuda);

    // Compare
    const epsilon = 1e-3; // GPU may have slightly different rounding
    var max_diff: f32 = 0;
    for (c_cpu, c_cuda) |cpu, cuda| {
        const diff = @abs(cpu - cuda);
        max_diff = @max(max_diff, diff);
        try testing.expectApproxEqAbs(cpu, cuda, epsilon);
    }

    std.debug.print("Max difference: {d:.6}\n", .{max_diff});
}

Requirements:

Test all backend combinations
Use random inputs to catch edge cases
Report maximum difference for debugging

4. Integration Tests

Test complete workflows (model training, inference).

Location: test/integration/

Example:

// test/integration/mnist_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;

test "integration: train simple MLP on synthetic data" {
    const allocator = testing.allocator;

    // Define model
    const Model = ztorch.Sequential(.{
        ztorch.Linear(10, 20),
        ztorch.ReLU(),
        ztorch.Linear(20, 2),
    });

    var model = try Model.compile(.cpu, allocator);
    defer model.deinit();

    // Synthetic data (linearly separable)
    const batch_size = 32;
    var input = try ztorch.Tensor.zeros(.{batch_size, 10}, .f32, .cpu);
    defer input.deinit();
    var labels = try ztorch.Tensor.zeros(.{batch_size}, .i32, .cpu);
    defer labels.deinit();

    // Fill with pattern
    for (0..batch_size) |i| {
        const label: i32 = if (i < batch_size / 2) 0 else 1;
        labels.data[i] = label;

        for (0..10) |j| {
            input.data[i * 10 + j] = if (label == 0) -1.0 else 1.0;
        }
    }

    // Train for a few steps
    var initial_loss: f32 = 0;
    var final_loss: f32 = 0;

    for (0..100) |step| {
        const output = try model.forward(input);
        defer output.deinit();

        const loss = try ztorch.crossEntropy(output, labels);
        defer loss.deinit();

        if (step == 0) initial_loss = loss.item();
        if (step == 99) final_loss = loss.item();

        try model.backward(loss);
        try model.step(.{ .sgd = .{ .lr = 0.01 } });
    }

    // Loss should decrease
    try testing.expect(final_loss < initial_loss);

    // Should converge to low loss on this simple problem
    try testing.expect(final_loss < 0.1);
}

test "integration: save and load model" {
    // TODO: Implement serialization
    return error.SkipZigTest;
}

Requirements:

Test complete training loops
Verify loss decreases
Test inference
Test model serialization (future)

5. Property-Based Tests

Test properties that should hold for all inputs.

Example:

test "property: matmul associativity" {
    // (A @ B) @ C = A @ (B @ C)
    const allocator = testing.allocator;

    var rng = std.rand.DefaultPrng.init(42);

    // Small matrices for speed
    const n = 16;
    var a = try allocator.alloc(f32, n * n);
    defer allocator.free(a);
    var b = try allocator.alloc(f32, n * n);
    defer allocator.free(b);
    var c = try allocator.alloc(f32, n * n);
    defer allocator.free(c);

    // Random values
    for (0..n*n) |i| {
        a[i] = rng.random().float(f32);
        b[i] = rng.random().float(f32);
        c[i] = rng.random().float(f32);
    }

    // (A @ B) @ C
    var ab = try allocator.alloc(f32, n * n);
    defer allocator.free(ab);
    ztorch.ops.matmul_cpu(.{n, n}, a, b, ab);
    var abc_left = try allocator.alloc(f32, n * n);
    defer allocator.free(abc_left);
    ztorch.ops.matmul_cpu(.{n, n}, ab, c, abc_left);

    // A @ (B @ C)
    var bc = try allocator.alloc(f32, n * n);
    defer allocator.free(bc);
    ztorch.ops.matmul_cpu(.{n, n}, b, c, bc);
    var abc_right = try allocator.alloc(f32, n * n);
    defer allocator.free(abc_right);
    ztorch.ops.matmul_cpu(.{n, n}, a, bc, abc_right);

    // Should be approximately equal
    const epsilon = 1e-3; // Numerical error accumulates
    for (abc_left, abc_right) |left, right| {
        try testing.expectApproxEqAbs(left, right, epsilon);
    }
}

test "property: softmax sums to 1" {
    const allocator = testing.allocator;
    var rng = std.rand.DefaultPrng.init(42);

    const n = 100;
    var input = try allocator.alloc(f32, n);
    defer allocator.free(input);
    var output = try allocator.alloc(f32, n);
    defer allocator.free(output);

    for (0..10) |_| {
        // Random input
        for (0..n) |i| {
            input[i] = rng.random().float(f32) * 20 - 10; // [-10, 10]
        }

        ztorch.ops.softmax_cpu(input, output);

        // Sum should be 1
        var sum: f32 = 0;
        for (output) |val| sum += val;

        try testing.expectApproxEqAbs(@as(f32, 1.0), sum, 1e-5);
    }
}

CI Configuration

.github/workflows/ci.yml:

name: CI

on:
  push:
    branches: [main, dev]
  pull_request:
    branches: [main]

jobs:
  test:
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        arch: [x86_64, aarch64]
        exclude:
          - os: windows-latest
            arch: aarch64

    runs-on: ${{ matrix.os }}

    steps:
      - uses: actions/checkout@v4

      - name: Setup Zig
        uses: goto-bus-stop/setup-zig@v2
        with:
          version: master

      - name: Check formatting
        run: zig fmt --check .

      - name: Build
        run: zig build --summary all

      - name: Run unit tests
        run: zig build test --summary all

      - name: Run integration tests
        run: zig build test-integration --summary all

      - name: Run benchmarks (smoke test)
        run: zig build bench --summary all
        env:
          BENCH_ITERATIONS: 10 # Quick smoke test

  test-cuda:
    runs-on: ubuntu-latest
    # Requires self-hosted runner with GPU
    # if: github.event_name == 'push'

    steps:
      - uses: actions/checkout@v4

      - name: Setup Zig
        uses: goto-bus-stop/setup-zig@v2
        with:
          version: master

      - name: Setup CUDA
        uses: Jimver/cuda-toolkit@v0.2.11
        with:
          cuda: "12.1.0"

      - name: Run CUDA tests
        run: zig build test-cuda --summary all

      - name: Run backend parity tests
        run: zig build test-parity --summary all

  coverage:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Zig
        uses: goto-bus-stop/setup-zig@v2
        with:
          version: master

      - name: Run tests with coverage
        run: zig build test -Dcoverage

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./zig-out/coverage.txt

Test Execution

# Run all tests
zig build test

# Run specific test file
zig build test -- test/ops/matmul_test.zig

# Run with verbose output
zig build test --summary all

# Run benchmarks
zig build bench

# Run only fast tests (no GPU, no integration)
zig build test-fast

# Run GPU tests only
zig build test-cuda

# Run backend parity tests
zig build test-parity

Test Coverage Requirements

Minimum: 90% line coverage
Target: 95%+ line coverage
Every public API must be tested
Every backend implementation must be tested

Continuous Benchmarking

Track performance over time to catch regressions.

# Run benchmarks and save results
zig build bench --output bench-results.json

# Compare against baseline
zig build bench-compare --baseline main

Example output:

=== Benchmark Comparison ===
MatMul 1024x1024:
  main:    12.3 µs (baseline)
  current: 11.8 µs (4.1% faster ✓)

ReLU 1M elements:
  main:    0.5 ms (baseline)
  current: 0.6 ms (20% slower ✗)  <-- REGRESSION!

Test Writing Guidelines

Name tests clearly

   test "matmul: 2x2 identity matrix"  // ✓ Good
   test "test1"                         // ✗ Bad

One concept per test
- Test identity matrix separately from known result
- Makes failures easier to diagnose
Use descriptive assertions

   try testing.expectEqual(@as(f32, 19), result[0]);  // ✓ Shows expected
   try testing.expect(result[0] == 19);                // ✗ Less clear

Clean up resources

   var tensor = try Tensor.zeros(.{100}, .f32, .cpu);
   defer tensor.deinit();  // Always cleanup

Document complex setups

   // Testing matmul with non-square matrices
   // A: (2, 3), B: (3, 4), Expected C: (2, 4)

Performance Testing

See benchmarking.md for full details on performance testing.

Ztorch Benchmarking

Performance measurement and validation strategy.

Philosophy

Napkin math first - Estimate before measuring
Prove every gain - Optimizations must show measurable improvement
Track regressions - Continuous benchmarking catches slowdowns
Document results - Publish benchmarks for transparency

Napkin Math

Before implementing any operation, estimate its performance.

Example: MatMul

Operation: C = A @ B
Shapes: A(M, K), B(K, N), C(M, N)
Example: (1024, 1024) @ (1024, 1024)

FLOPs:
  2 * M * K * N = 2 * 1024^3 = 2,147,483,648 FLOPs ≈ 2.1 GFLOPs

Memory:
  Read: M*K + K*N = 1024*1024 + 1024*1024 = 2*1024^2 = 2,097,152 elements
  Write: M*N = 1024*1024 = 1,048,576 elements
  Total: 3 * 1024^2 * 4 bytes = 12 MB

GPU: RTX 4090
  Peak compute: 82 TFLOPS (FP32)
  Peak bandwidth: 1 TB/s

  Compute bound if: FLOPs/byte > Peak FLOPs / Peak bandwidth
    2.1 GFLOPs / 12 MB = 175 FLOPs/byte
    82 TFLOPS / 1 TB/s = 82 FLOPs/byte
    175 > 82, so compute bound ✓

  Expected time (compute bound):
    2.1 GFLOPs / 82 TFLOPS = 25.6 µs

  Expected time (memory bound):
    12 MB / 1 TB/s = 12 µs

  Realistically: ~25-50 µs (accounting for overhead)

CPU: Single core, 5 GHz
  Peak (optimistic): ~100 GFLOPS (AVX2)
  Realistic: ~20 GFLOPS

  Expected time:
    2.1 GFLOPs / 20 GFLOPS = 105 ms

The estimate gives us:

Performance targets
Expected speedup ratios
Sanity checks

Benchmark Framework

Implementation

// bench/framework.zig
const std = @import("std");
const ztorch = @import("ztorch");
const duration = ztorch.util.duration;

pub const BenchResult = struct {
    name: []const u8,
    iterations: usize,
    total_ns: u64,
    mean_ns: u64,
    median_ns: u64,
    min_ns: u64,
    max_ns: u64,
    p99_ns: u64,

    pub fn print(self: BenchResult) !void {
        std.debug.print("=== Benchmark: {s} ===\n", .{self.name});
        std.debug.print("Iterations: {}\n", .{self.iterations});
        try printStat("Mean", self.mean_ns);
        try printStat("Median", self.median_ns);
        try printStat("Min", self.min_ns);
        try printStat("Max", self.max_ns);
        try printStat("P99", self.p99_ns);
    }
};

pub fn benchmark(
    comptime name: []const u8,
    comptime iterations: usize,
    func: anytype,
    args: anytype,
) !BenchResult {
    var times = try std.heap.page_allocator.alloc(u64, iterations);
    defer std.heap.page_allocator.free(times);

    const warmup_iterations: usize = if (iterations < 10) iterations else 10;
    for (0..warmup_iterations) |_| {
        try @call(.auto, func, args);
    }

    // Measure
    for (0..iterations) |i| {
        var timer = try std.time.Timer.start();
        try @call(.auto, func, args);
        times[i] = timer.read();
    }

    // Sort for median and percentiles
    std.sort.pdq(u64, times, {}, comptime std.sort.asc(u64));

    // Compute statistics
    var total: u64 = 0;
    for (times) |t| total += t;

    return BenchResult{
        .name = name,
        .iterations = iterations,
        .total_ns = total,
        .mean_ns = total / iterations,
        .median_ns = times[iterations / 2],
        .min_ns = times[0],
        .max_ns = times[iterations - 1],
        .p99_ns = times[(iterations * 99) / 100],
    };
}

fn printStat(label: []const u8, ns: u64) !void {
    const text = try duration.formatDuration(std.heap.page_allocator, ns);
    defer std.heap.page_allocator.free(text);
    std.debug.print("{s}: {s}\n", .{ label, text });
}

The duration formatter is shared with the custom test runner, ensuring benchmarks and tests report timings in a consistent style.

Usage

const std = @import("std");
const framework = @import("../framework.zig");

fn step(allocator: std.mem.Allocator) !void {
    // code-under-test goes here
    _ = allocator;
}

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const result = try framework.benchmark(
        "demo.step",
        100,
        step,
        .{ gpa.allocator() },
    );
    try result.print();
}

Operation Benchmarks

MatMul Benchmarks

$ zig build bench
...
=== Benchmark: matmul.cpu_scalar ===
Iterations: 10
Mean: 52.52 ms
Median: 52.49 ms
Min: 52.27 ms
Max: 53.38 ms
P99: 53.38 ms
Size: 256x256, GFLOPS: 0.64

Activation Benchmarks

$ zig build bench
...
=== Benchmark: relu.cpu_scalar.forward/1M_f32 ===
Iterations: 200
Mean: 1.78 ms
Median: 1.79 ms
Min: 1.75 ms
Max: 1.84 ms
P99: 1.83 ms
Elements: 1000000, GFLOPS: 0.56

Comparison Benchmarks

Compare against established libraries.

// bench/comparison/vs_numpy.zig
pub fn benchmarkVsNumPy() !void {
    // Generate test data
    // Run NumPy matmul (via Python subprocess)
    // Run Ztorch matmul
    // Compare times

    std.debug.print("NumPy:  {d:.2} ms\n", .{numpy_time});
    std.debug.print("Ztorch: {d:.2} ms\n", .{ztorch_time});
    std.debug.print("Speedup: {d:.2}x\n", .{numpy_time / ztorch_time});
}

Continuous Benchmarking

Regression Detection

Run benchmarks on every commit and compare against baseline.

# Save baseline
zig build bench --save-baseline main.json

# Compare current against baseline
zig build bench --compare main.json

# Output
=== Regression Report ===
MatMul 1024x1024:
  Baseline: 12.5 ms
  Current:  15.3 ms
  Change:   +22.4% ✗ REGRESSION

ReLU 1M elements:
  Baseline: 2.5 ms
  Current:  2.4 ms
  Change:   -4.0% ✓ Improvement

CI Integration

# .github/workflows/bench.yml
name: Benchmark

on:
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0 # Need history for comparison

      - name: Setup Zig
        uses: goto-bus-stop/setup-zig@v2

      - name: Download baseline
        run: |
          gh run download --name benchmark-baseline --repo ${{ github.repository }}
        env:
          GH_TOKEN: ${{ github.token }}

      - name: Run benchmarks
        run: zig build bench --compare baseline.json --output results.json

      - name: Comment PR
        uses: actions/github-script@v6
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('results.json'));

            let comment = '## Benchmark Results\n\n';
            for (const result of results) {
              const change = ((result.current - result.baseline) / result.baseline * 100).toFixed(1);
              const emoji = change > 5 ? '⚠️' : change < -5 ? '✅' : '➖';
              comment += `${emoji} ${result.name}: ${change}%\n`;
            }

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

Performance Targets

Minimum Requirements (v0.1)

Operation	Size	CPU Scalar	CPU SIMD	CUDA (RTX 4090)
MatMul	1024²	5 GFLOPS	>10 GFLOPS	>1000 GFLOPS
ReLU	1M	0.4 GFLOPS	>2 GFLOPS	>10 GFLOPS
Softmax	1M	0.07 GFLOPS	>0.2 GFLOPS	>5 GFLOPS

Stretch Goals (v0.2)

Match or exceed PyTorch performance on CPU
Reach 50%+ of theoretical peak on GPU
Sub-millisecond inference for small models

Publishing Results

All benchmark results are published in the repository.

benchmarks/results/

results/
├── cpu-scalar/
│   ├── matmul.json
│   ├── relu.json
│   └── ...
├── cpu-avx2/
│   └── ...
├── cuda-rtx4090/
│   └── ...
└── comparisons/
    ├── vs-pytorch.md
    └── vs-numpy.md

Example: benchmarks/results/cpu-scalar/matmul.json

{
  "operation": "matmul",
  "backend": "cpu-scalar",
  "hardware": "Intel i9-13980HX",
  "date": "2024-11-06",
  "results": [
    {
      "size": [1024, 1024, 1024],
      "iterations": 100,
      "mean_ns": 12500000,
      "gflops": 0.17
    }
  ]
}

Tools

# Run all benchmarks
zig build bench

# Run specific operation
zig build bench-matmul
zig build bench-activations

# Compare backends
zig build bench-compare-backends

# Generate report
zig build bench-report --output report.md

# Profile (Linux only)
zig build bench-matmul --profile
# Generates flamegraph.svg

Tiger Style for Ztorch

Ztorch follows Tiger Style development philosophy, adapted from TigerBeetle.

Core Principles

1. Safety First

No Undefined Behavior

// ✓ Good: Explicit size
const n: u32 = 1024;

// ✗ Bad: Architecture-dependent
const n: usize = 1024;

Fixed Limits

// ✓ Good: Bounded
const MAX_DIMS = 8;
for (0..@min(ndim, MAX_DIMS)) |i| { ... }

// ✗ Bad: Unbounded
while (has_more_dims()) { ... }

Static Allocation

// ✓ Good: Allocate at init
pub fn init(allocator: Allocator) !Model {
    const weights = try allocator.alloc(f32, WEIGHT_SIZE);
    return Model{ .weights = weights };
}

// ✗ Bad: Dynamic allocation in hot path
pub fn forward(self: *Model) !Tensor {
    const temp = try self.allocator.alloc(f32, runtime_size); // ✗
    // ...
}

Explicit Error Handling

// ✓ Good: Handle all errors
const result = matmul(a, b) catch |err| {
    log.err("MatMul failed: {}", .{err});
    return err;
};

// ✗ Bad: Ignore errors
const result = matmul(a, b) catch unreachable; // ✗

2. Performance

Napkin Math First

// Before implementing, calculate:
// MatMul (1024, 1024, 1024):
//   FLOPs: 2 * 1024^3 = 2.1 GFLOPs
//   Memory: 12 MB
//   Expected time: ~50µs on RTX 4090
//
// Then implement and measure actual vs expected.

Batch Operations

// ✓ Good: Process multiple items
pub fn forward_batch(inputs: []Tensor) ![]Tensor {
    // Single kernel launch for all inputs
}

// ✗ Bad: One at a time
for (inputs) |input| {
    _ = try forward(input); // Multiple kernel launches
}

Optimize Resources in Order

Network (if distributed)
Disk (if I/O bound)
Memory (usually the bottleneck for ML)
CPU/GPU (optimize last)

3. Developer Experience

Clear Naming

// ✓ Good: Descriptive
pub fn matmul_cpu_scalar(a: Tensor, b: Tensor) Tensor { ... }
const latency_ms_max: f32 = 100.0;

// ✗ Bad: Abbreviated
pub fn mm(a: T, b: T) T { ... }
const max_lat: f32 = 100.0;

Document the Why

// ✓ Good: Explains reason
// We subtract max before exp() to prevent overflow.
// For input [1000, 1001], exp() would overflow, but
// exp([0, 1]) / sum(exp([0, 1])) gives same result.
const max_val = max(input);
for (input) |val| {
    exp(val - max_val);
}

// ✗ Bad: Just describes what
// Subtract max
const max_val = max(input);

Organize Logically

// ✓ Good: Grouped by domain
src/
├── ops/           # Operations
│   ├── matmul.zig
│   ├── relu.zig
│   └── softmax.zig
├── backends/      # Backend implementations
│   ├── cpu.zig
│   └── cuda.zig
└── ir/            # Internal representation

// ✗ Bad: Mixed concerns
src/
├── stuff.zig
├── more_stuff.zig
└── utils.zig

Specific Guidelines for Ztorch

Tensor Operations

Always Check Shapes

pub fn matmul(a: Tensor, b: Tensor) !Tensor {
    // Tiger Style: Validate inputs
    if (a.shape.dims[a.shape.ndim - 1] != b.shape.dims[0]) {
        return error.ShapeMismatch;
    }
    // ...
}

Use Comptime When Possible

// ✓ Good: Shapes known at compile time
const Model = ztorch.Sequential(.{
    ztorch.Linear(784, 128),  // comptime validation
    ztorch.ReLU(),
    ztorch.Linear(128, 10),   // 128 matches!
});

// ✗ Bad: Shapes only checked at runtime
var model = Model.init();
model.addLayer(Linear.init(784, 128));
model.addLayer(ReLU.init());
model.addLayer(Linear.init(64, 10));  // Oops, 64 != 128!

Backend Implementation

Reference Implementation First

// Step 1: CPU scalar (obviously correct)
pub fn relu_cpu_scalar(input: []f32, output: []f32) void {
    for (input, output) |in, *out| {
        out.* = @max(0, in);
    }
}

// Step 2: Test thoroughly
test "relu: cpu scalar" { ... }

// Step 3: Optimize (SIMD)
pub fn relu_cpu_simd(input: []f32, output: []f32) void {
    // ...
}

// Step 4: Verify against reference
test "relu: cpu simd vs scalar" {
    try expectEqualSlices(scalar_result, simd_result);
}

Prove GPU Optimizations

# Before optimization
MatMul 1024x1024: 100µs (21 GFLOPS)

# After tiling optimization
MatMul 1024x1024: 25µs (84 GFLOPS)  # 4x speedup ✓

# Document in code:
// Tiled implementation achieves 84 GFLOPS vs 21 GFLOPS naive (4x).
// Uses 32x32 tiles to maximize shared memory reuse.

Error Handling

Fail Fast on Programmer Errors

pub fn forward(self: *Model, input: Tensor) !Tensor {
    // Programmer error: wrong input shape
    if (input.shape.dims[1] != self.input_size) {
        // Tiger Style: This is a bug, not a runtime error
        std.debug.panic(
            "Input shape mismatch: expected {}, got {}",
            .{self.input_size, input.shape.dims[1]}
        );
    }

    // Runtime error: GPU out of memory
    const output = self.backend.alloc(output_size) catch |err| {
        // This can happen, return error
        return err;
    };

    return output;
}

Memory Management

Explicit Lifetimes

// ✓ Good: Clear ownership
pub fn forward(self: *Model, input: Tensor) !Tensor {
    const output = try Tensor.zeros(.{32, 10}, .f32, .cpu);
    // Caller owns output, caller must free
    return output;
}

pub fn example() !void {
    const output = try model.forward(input);
    defer output.deinit();  // Explicit cleanup
}

// ✗ Bad: Unclear ownership
pub fn forward(self: *Model, input: Tensor) !Tensor {
    // Who frees this? Model? Caller?
    const output = try self.allocator.create(Tensor);
    // ...
}

Anti-Patterns to Avoid

Magic Numbers

// ✗ Bad
if (size > 1024) { ... }

// ✓ Good
const MAX_BATCH_SIZE = 1024;
if (size > MAX_BATCH_SIZE) { ... }

Premature Abstraction

// ✗ Bad: Over-engineered
pub const BackendFactory = struct {
    pub fn create(comptime T: type) Backend(T) { ... }
};

// ✓ Good: Simple
pub fn createCpuBackend() Backend { ... }
pub fn createCudaBackend() Backend { ... }

Hidden Allocations

// ✗ Bad: Surprise allocation
pub fn concat(a: Tensor, b: Tensor) Tensor {
    const result = Tensor.alloc(...);  // Hidden!
    // ...
}

// ✓ Good: Explicit
pub fn concat(allocator: Allocator, a: Tensor, b: Tensor) !Tensor {
    const result = try Tensor.alloc(allocator, ...);
    // ...
}

Ignoring Errors

// ✗ Bad
const result = riskyOperation() catch unreachable;

// ✓ Good
const result = riskyOperation() catch |err| {
    log.err("Operation failed: {}", .{err});
    return err;
};

Checklist for PRs

Before submitting, verify:

All tests pass on all platforms
Code formatted with zig fmt
No undefined behavior (checked with assertions)
Fixed resource limits where applicable
Error handling is explicit
Napkin math documented for performance code
Benchmarks prove optimizations
Public APIs documented
Complex logic has comments explaining "why"
No magic numbers
Memory ownership is clear

Resources

Questions?

If you're unsure whether something follows Tiger Style, ask in a PR or issue. We're here to help!

Ztorch Zig 0.15.x Quick Reference: DOs and DON'Ts

Language Syntax

❌ DON'T use `usingnamespace`

// WRONG - Removed in 0.15.x
pub usingnamespace @import("other.zig");

✅ DO use explicit declarations or conditionals

// RIGHT
pub const foo = other.foo;
pub const bar = other.bar;

// OR for conditional inclusion:
pub const init = if (condition) initWindows else initLinux;

I/O and Printing

❌ DON'T use old generic writer API

// WRONG
const stdout = std.io.getStdOut().writer();
try stdout.print("...", .{});

✅ DO provide explicit buffer to writer

// RIGHT
var stdout_buffer: [4096]u8 = undefined;
var stdout_writer = std.fs.File.stdout().writer(&stdout_buffer);
const stdout = &stdout_writer.interface;

try stdout.print("...", .{});
try stdout.flush(); // Always flush!

Format Strings

❌ DON'T use `{}` for custom types with format methods

// WRONG
std.debug.print("{}", .{my_custom_type});

✅ DO use `{f}` to call format methods explicitly

// RIGHT
std.debug.print("{f}", .{my_custom_type});

❌ DON'T pass format string to format() method

// WRONG
pub fn format(
    self: @This(),
    comptime fmt: []const u8,
    options: std.fmt.FormatOptions,
    writer: anytype,
) !void { ... }

✅ DO use new simplified format signature

// RIGHT
pub fn format(self: @This(), writer: *std.Io.Writer) std.Io.Writer.Error!void {
    try writer.print("value: {d}", .{self.value});
}

Casts

❌ DON'T pass the target type to the cast builtin

// WRONG
const counter: u64 = @intCast(u64, input);
const seconds: f64 = @floatFromInt(f64, nanoseconds);

✅ DO use `@as` with the dedicated cast routines

// RIGHT
const counter = @as(u64, @intCast(input));
const seconds = @as(f64, @floatFromInt(nanoseconds));

These casts are now single-argument functions in Zig 0.15.x; wrap the result with @as to bind the target type.

✅ DO remember that the helper returns the source type

The helper builtins (@intCast, @floatCast, @floatFromInt, …) now return a value whose type is inferred from the operand. Use @as (or the inferred type via var) to coerce to the destination.

const src_f16: f16 = 1.5;
const dst_f32 = @as(f32, @floatCast(src_f16)); // ✅ explicit destination

const src_i16: i16 = 42;
var dst_i32: i32 = undefined;
dst_i32 = @as(i32, @intCast(src_i16));        // ✅ widening cast

Skips to @floatCast(f32, src) or @intCast(i32, src) will not compile anymore.

ArrayList

❌ DON'T expect allocator field in ArrayList

// WRONG - ArrayList doesn't store allocator anymore
var list = std.ArrayList(u8).init(allocator);
try list.append(item); // No allocator stored!

✅ DO pass allocator to each operation

// RIGHT - Unmanaged by default
var list = std.ArrayList(u8){};
try list.append(allocator, item);
try list.appendSlice(allocator, items);
list.deinit(allocator);

// OR use Managed if you really want allocator field
var list = std.array_list.Managed(u8).init(allocator);
try list.append(item);

File Operations

❌ DON'T use deprecated reader()/writer()

// WRONG
const file = try std.fs.cwd().openFile("file.txt", .{});
const reader = file.reader(); // Deprecated!

✅ DO use new buffer-aware API

// RIGHT
const file = try std.fs.cwd().openFile("file.txt", .{});
var buffer: [4096]u8 = undefined;
var file_reader = file.reader(&buffer);
const reader = &file_reader.interface;

Inline Assembly Clobbers

❌ DON'T use string clobbers

// WRONG
asm volatile ("syscall"
    : [ret] "={rax}" (-> usize),
    : [num] "{rax}" (number),
    : "rcx", "r11"  // String clobbers!
);

✅ DO use typed clobbers

// RIGHT
asm volatile ("syscall"
    : [ret] "={rax}" (-> usize),
    : [num] "{rax}" (number),
    : .{ .rcx = true, .r11 = true }  // Typed!
);

Run zig fmt to auto-upgrade this!

Compression (flate/zlib/gzip)

❌ DON'T use old compress API

// WRONG
var decompress = try std.compress.zlib.decompressor(reader);

✅ DO use new unified flate API

// RIGHT
var decompress_buffer: [std.compress.flate.max_window_len]u8 = undefined;
var decompress: std.compress.flate.Decompress = .init(reader, .zlib, &decompress_buffer);
const decompress_reader = &decompress.reader;

Data Structures

❌ DON'T use BoundedArray

// WRONG - Removed
var stack = try std.BoundedArray(i32, 8).fromSlice(items);

✅ DO use ArrayList with stack buffer

// RIGHT
var buffer: [8]i32 = undefined;
var stack = std.ArrayListUnmanaged(i32).initBuffer(&buffer);
try stack.appendSliceBounded(items);

❌ DON'T use generic LinkedList

// WRONG
var list = std.DoublyLinkedList(MyType).init();

✅ DO use intrusive list with @fieldParentPtr

// RIGHT
const MyType = struct {
    node: std.DoublyLinkedList.Node = .{},
    data: i32,
};

var list: std.DoublyLinkedList = .{};
// Use @fieldParentPtr("node", node_ptr) to get back to MyType

Error Handling in undefined

❌ DON'T do arithmetic on undefined

// WRONG - Compile error in 0.15.x
const a: u32 = 0;
const b: u32 = undefined;
const c = a + b; // ERROR: use of undefined causes illegal behavior

✅ DO avoid operations on undefined

// RIGHT - Don't use undefined in arithmetic
const a: u32 = 0;
const b: u32 = 0; // Use actual value or @import("std").mem.zeroes(u32)
const c = a + b;

Pointers and Casting

✅ DO use @ptrCast for single-item to slice

// NEW in 0.15.x - This now works!
const val: u32 = 1;
const bytes: []const u8 = @ptrCast(&val);
// Returns slice of same byte size as operand

✅ DO use the new numeric casts

// ints -> floats
const items: usize = 42;
const weight: f32 = @as(f32, @floatFromInt(items));

// float width changes
const precise: f64 = 3.1415926535;
const pi_approx: f32 = @floatCast(precise);

// ints -> narrower ints (panics on overflow in Debug)
const index: usize = 128;
const idx_i32: i32 = @intCast(index);

❌ DON'T rely on implicit coercion

// WRONG: no more implicit float narrowing
const precise: f64 = 3.14;
const pi: f32 = precise; // Compile error

✅ DO use @as for explicit coercion

// RIGHT: @as documents intent and stays future proof
const precise: f64 = 3.14;
const pi: f32 = @floatCast(precise);
const total: usize = @as(usize, @intCast(@max(10, 5)));

Switch on Non-Exhaustive Enums

✅ DO mix explicit tags with `_` prong

// NEW in 0.15.x - This is now allowed
switch (enum_val) {
    .special_case_1 => foo(),
    .special_case_2 => bar(),
    _, .special_case_3 => baz(),  // Mix unnamed (_) with named
}

Build System

❌ DON'T use implicit root module fields

// WRONG - Removed in 0.15.x
const exe = b.addExecutable(.{
    .name = "zros",
    .root_source_file = b.path("src/main.zig"),  // WRONG!
    .target = target,
    .optimize = optimize,
});

✅ DO use explicit root_module

// RIGHT
const exe = b.addExecutable(.{
    .name = "zros",
    .root_module = b.createModule(.{
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    }),
});

Quick Migration Checklist

✅ Remove all usingnamespace (use conditionals or explicit imports)
✅ Update all I/O code to provide explicit buffers
✅ Change {} to {f} for custom format methods
✅ Remove format strings from format() method signatures
✅ Update ArrayList usage (pass allocator to operations)
✅ Run zig fmt to auto-fix inline assembly clobbers
✅ Update build.zig to use root_module
✅ Replace BoundedArray with ArrayList + stack buffer
✅ Update compress/flate API usage
✅ Always flush() writers explicitly

For OS Development (Freestanding)

✅ These work in freestanding mode:

std.mem.*           // Memory operations
std.fmt.*           // Formatting (with your writer)
std.debug.*         // Assertions
std.math.*          // Math
std.ArrayList       // Data structures
std.HashMap         // Data structures
@memcpy, @memset    // Compiler builtins

❌ These DON'T work in freestanding:

std.fs.*            // No filesystem yet (you're building it!)
std.net.*           // No network yet (you're building it!)
std.os.*            // You ARE the OS
std.io.*            // Use std.Io.Reader/Writer with your own backend

Kernel-Specific Tips

✅ DO use explicit buffer management

// Perfect for kernel - no hidden allocations!
var buffer: [1024]u8 = undefined;
var writer = myDevice.writer(&buffer);
const w = &writer.interface;
try w.print("Kernel message\n", .{});
try w.flush();

✅ DO use ArrayListUnmanaged for kernel data structures

// No hidden allocator field - perfect for kernel
var process_list = std.ArrayListUnmanaged(*Process){};
try process_list.append(my_allocator, new_process);

✅ DO use @memcpy and @memset (compiler builtins)

// Optimized by LLVM for your target
@memcpy(dest, src);
@memset(buffer, 0);

Auto-Fix Tools

Run these to auto-migrate:

zig fmt           # Fixes inline assembly clobbers
# Manual fixes needed for everything else (sorry!)

When in Doubt

Check compiler errors - they're usually clear in 0.15.x
Look at lib/std/ source code for examples
The new APIs are often simpler than old ones
Explicit is better than implicit (Zig philosophy)

Remember: Most breaking changes make kernel development easier (explicit buffers, simpler APIs, faster compilation).

Contributing to Ztorch

Thank you for your interest in contributing to Ztorch!

Code of Conduct

Be respectful, constructive, and professional. We're building something useful together.

Development Philosophy

Ztorch follows Tiger Style development:

Safety first
Benchmarked performance
Clean developer experience
Zero technical debt

See tiger-style.md for full details.

Getting Started

Prerequisites

Zig 0.15.x (latest)
Git
(Optional) CUDA Toolkit 12.0+ for GPU development
(Optional) ROCm 5.0+ for AMD GPU development

Setup

# Clone
git clone https://github.com/mattneel/ztorch.git
cd ztorch

# Build
zig build

# Run tests
zig build test

# Run benchmarks
zig build bench

Contribution Workflow

1. Find or Create an Issue

Check existing issues first
For new features, discuss in an issue before implementing
For bugs, provide minimal reproduction

2. Fork and Branch

git checkout -b feature/add-conv2d
# or
git checkout -b fix/matmul-gradient-bug

3. Implement

TDD Workflow:

Write test first (red)
Implement minimal code (green)
Refactor
Benchmark if performance-sensitive

Example: Adding a New Operation

// 1. Write test (test/ops/new_op_test.zig)
test "new_op: known result" {
    const input = [_]f32{ 1, 2, 3 };
    var output: [3]f32 = undefined;

    ztorch.ops.new_op_cpu(&input, &output);

    try testing.expectEqual(@as(f32, expected), output[0]);
}

// 2. Implement (src/ops/cpu/new_op.zig)
pub fn new_op_cpu(input: []const f32, output: []f32) void {
    for (input, output) |in, *out| {
        out.* = /* implementation */;
    }
}

// 3. Add gradient test
test "new_op: gradient check" {
    // ... numerical gradient verification
}

// 4. Benchmark
bench "new_op: 1M elements" {
    // ... measure performance
}

// 5. Add to public API (src/ztorch.zig)
pub const new_op = ops.new_op_cpu;

4. Test

# Run all tests
zig build test

# Run specific test
zig build test -- test/ops/new_op_test.zig

# Check formatting
zig fmt --check .

# Fix formatting
zig fmt .

5. Benchmark

If your change affects performance:

# Run benchmarks
zig build bench

# Compare against main
git checkout main
zig build bench --save-baseline main.json
git checkout your-branch
zig build bench --compare main.json

Include benchmark results in your PR description.

6. Document

Add docstrings to public functions
Update README if adding features
Update relevant docs/ files
Add examples/ if appropriate

7. Commit

git add .
git commit -m "feat: add conv2d operation

- Implement CPU scalar version
- Add unit tests and gradient checks
- Benchmark: 2.3 GFLOPS on 32x32 kernels
- Refs #42"

Commit Message Format:

<type>: <subject>

<body>

<footer>

Types:

feat: New feature
fix: Bug fix
perf: Performance improvement
test: Adding tests
docs: Documentation only
refactor: Code change that neither fixes bug nor adds feature
chore: Maintenance tasks

8. Push and PR

git push origin feature/add-conv2d

Create PR with:

Clear description of what and why
Link to related issue
Test results
Benchmark results (if applicable)
Breaking changes (if any)

PR Review Process

Automated Checks: CI must pass (all platforms)
Code Review: Maintainer reviews code
Discussion: Address feedback
Approval: Maintainer approves
Merge: Maintainer merges

Review Timeline:

Initial response: 1-3 days
Complete review: 1-2 weeks (depending on complexity)

What We're Looking For

High Priority

Core operations (MatMul, Conv, Attention, etc.)
Backend implementations (CUDA, ROCm, Vulkan)
Performance optimizations (with benchmarks!)
Bug fixes
Test coverage improvements
Documentation improvements

Medium Priority

Additional frontends (PyTorch, TensorFlow import)
Serialization/deserialization
Quantization support
More activation functions

Low Priority

Minor refactors
Code style changes
Non-critical features

Guidelines

Code Style

Follow Tiger Style principles:

Do:

// Clear names
pub fn matmul_cpu_scalar(a: Tensor, b: Tensor) Tensor { ... }

// Explicit sizes
const n: u32 = 1024;

// Fixed limits
const MAX_BATCH_SIZE: usize = 1024;
for (0..@min(batch_size, MAX_BATCH_SIZE)) |i| { ... }

// Simple control flow
if (condition) {
    simple_case();
} else {
    other_case();
}

Don't:

// Vague names
pub fn mm(a: T, b: T) T { ... }

// Architecture-dependent sizes
const n: usize = 1024;

// Unbounded loops
while (has_more()) { ... }

// Complex nested logic
if (a) {
    if (b) {
        if (c) { ... }
    }
}

Testing

Every PR must include:

Unit tests for new functionality
Gradient checks for differentiable ops
Backend parity tests for GPU implementations
Benchmarks for performance-sensitive code

Test Coverage:

Minimum 90% line coverage
All error paths tested
Edge cases covered

Documentation

/// Matrix multiplication: C = A @ B
///
/// Computes the matrix product of two 2D tensors.
///
/// # Arguments
/// * `a` - Left matrix of shape (M, K)
/// * `b` - Right matrix of shape (K, N)
///
/// # Returns
/// Result matrix of shape (M, N)
///
/// # Example
/// ```
/// const a = try Tensor.randn(.{32, 64});
/// const b = try Tensor.randn(.{64, 128});
/// const c = try matmul(a, b);
/// ```
pub fn matmul(a: Tensor, b: Tensor) !Tensor {
    // ...
}

Performance

Always napkin math first
Benchmark before and after
Document expected vs actual performance
Prove optimizations with numbers

Safety

No undefined behavior
Explicit error handling
Fixed resource limits
Assertions for invariants

Common Issues

CI Failures

"Test failed on Windows"

Windows has different line endings
Use \n consistently
Run tests locally on Windows if possible

"Formatting check failed"

zig fmt .
git add .
git commit --amend --no-edit
git push --force

"Benchmark regression"

Investigate why performance decreased
Either fix the regression or justify it
Update baseline if intentional

Review Feedback

"This needs tests"

Add missing test cases
Ensure coverage is adequate

"Can you add a benchmark?"

Add benchmark for the new code
Compare against baseline

"This violates Tiger Style"

Review tiger-style.md
Refactor accordingly

Getting Help

Questions: Open a discussion
Bugs: Open an issue
Features: Open an issue for discussion first
Chat: (Discord/Matrix link if available)

Recognition

Contributors are recognized in:

CONTRIBUTORS.md file
Release notes
Git history

Significant contributions may earn you commit access.

License

By contributing, you agree that your contributions will be licensed under the same license as the project (Apache 2.0 or MIT).

Keyboard shortcuts

Ztorch Documentation