Ztorch
Pure Zig machine learning library with comptime optimization and multiple backends
Ztorch is a high-performance ML library built from the ground up in Zig, featuring:
- ๐ฏ Comptime graph optimization - Fuse operations and generate specialized kernels at compile time
- ๐ Multiple backends - CPU (scalar & SIMD), CUDA, ROCm, Vulkan
- ๐ก๏ธ Tiger Style development - Safety first, benchmarked performance, zero technical debt
- ๐งช Test-driven - TDD from day 0, tested on Linux/macOS/Windows, x86_64/aarch64
- ๐ง Clean API - Define models as Zig structs, explicit and ergonomic
Status
v0.1-dev - Early development. Core architecture and CPU backend in progress.
โ
Project architecture defined
โ
Tiger Style development process established
๐ง CPU scalar backend (in progress)
โณ CUDA backend (planned)
โณ ROCm backend (planned)
โณ Vulkan backend (planned)
Quick Start
const std = @import("std");
const ztorch = @import("ztorch");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
// Define model at comptime
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 128),
ztorch.ReLU(),
ztorch.Linear(128, 10),
ztorch.Softmax(),
});
// Compile for backend
var model = try Model.compile(.cpu, gpa.allocator());
defer model.deinit();
// Train
const input = try ztorch.Tensor.randn(.{32, 784});
const labels = try ztorch.Tensor.randint(.{32}, 10);
const output = try model.forward(input);
const loss = try ztorch.crossEntropy(output, labels);
try model.backward(loss);
try model.step(.{ .adam = .{ .lr = 0.001 } });
std.debug.print("Loss: {d:.4}\n", .{loss.item()});
}
Installation
Add to your build.zig.zon:
.dependencies = .{
.ztorch = .{
.url = "https://github.com/mattneel/ztorch/archive/refs/tags/v0.1.0.tar.gz",
.hash = "...",
},
},
In your build.zig:
const ztorch = b.dependency("ztorch", .{
.target = target,
.optimize = optimize,
});
exe.root_module.addImport("ztorch", ztorch.module("ztorch"));
Why Ztorch?
Comptime optimization: Your model structure is known at compile time, so Ztorch can fuse operations, eliminate overhead, and generate optimal kernels for your exact architecture.
Multiple backends: Write once, run on CPU, CUDA, ROCm, or Vulkan. Each backend is optimized for its target.
Proven correct: Every operation is tested against a reference implementation. GPU backends are verified against CPU. No surprises.
Benchmarked: Every optimization is measured. We prove the gains with napkin math and benchmarks.
No magic: Explicit control flow, static allocation, clear error handling. You know exactly what the machine is doing.
Building
# Run tests
zig build test
# Run benchmarks
zig build bench
# Build examples
zig build examples
# Check formatting
zig fmt --check .
Design Principles (Tiger Style)
- Safety - Fixed limits, static allocation, explicit errors, fail fast
- Performance - Napkin math first, benchmark everything, prove all gains
- Developer Experience - Clean API, clear names, good documentation
- Zero Technical Debt - TDD from day 0, do it right the first time
Ztorch Architecture
This document describes the internal architecture of Ztorch.
Overview
Ztorch is a compiler-based ML library. Models are defined at compile time (or runtime), converted to an internal representation (IR), optimized, and then compiled to backend-specific code.
Model Definition โ IR โ Optimization โ Autograd โ Backend Codegen โ Execution
Components
1. IR (Internal Representation)
The IR is Ztorch's internal graph representation. It's the single source of truth for all transformations.
pub const Graph = struct {
nodes: []Node,
edges: []Edge,
allocator: Allocator,
};
pub const Node = union(enum) {
matmul: MatMulOp,
relu: ActivationOp,
softmax: SoftmaxOp,
layernorm: LayerNormOp,
// ... more ops
};
pub const Edge = struct {
from: NodeId,
to: NodeId,
tensor_shape: Shape,
};
Design principles:
- Immutable after construction (transformations create new graphs)
- Validates shape compatibility at creation
- Lightweight - can be copied/cloned cheaply
2. Frontends
Frontends convert external model formats to Ztorch IR.
Native Zig API (v0.1):
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 128),
ztorch.ReLU(),
});
This comptime struct is converted to IR during compilation.
ONNX Import (v0.2+):
const graph = try ztorch.frontends.onnx.load("model.onnx");
3. Optimization Passes
Optimization passes transform the IR to improve performance.
v0.1 Optimizations:
- Operator fusion (e.g., MatMul + ReLU โ FusedMatMulReLU)
- Constant folding
- Dead code elimination
- Memory layout optimization
Example:
Before: MatMul โ ReLU โ Softmax (3 kernel launches)
After: FusedMatMulReLUSoftmax (1 kernel launch)
4. Autograd
The autograd system generates backward pass operations from forward pass IR.
Each operation has a gradient function:
pub const MatMulOp = struct {
pub fn forward(a: Tensor, b: Tensor) Tensor { ... }
pub fn backward(
d_output: Tensor,
a: Tensor,
b: Tensor,
) struct { d_a: Tensor, d_b: Tensor } {
// d_a = d_output @ b.T
// d_b = a.T @ d_output
...
}
};
The autograd pass walks the forward graph and generates the backward graph.
5. Backend Codegen
Backend codegen converts IR operations to executable code.
CPU Scalar (reference):
- Direct Zig implementation
- Simple, obviously correct
- Used for verification
CPU SIMD:
- Intrinsics for AVX2/AVX512 (x86)
- Intrinsics for NEON (ARM)
- Falls back to scalar if unsupported
CUDA:
- Generates PTX assembly
- Comptime specialization for shapes
- Tensor core utilization
ROCm:
- Generates LLVM IR
- Similar to CUDA approach
Vulkan:
- Generates SPIR-V
- Portable across vendors
6. Runtime
The runtime manages:
- Memory allocation (device buffers)
- Kernel launching
- Synchronization
- Error handling
Memory management:
- Static allocation during model compilation
- No dynamic allocation during forward/backward
- Explicit buffer reuse
Compilation Flow
Comptime Model Definition
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 128),
ztorch.ReLU(),
ztorch.Linear(128, 10),
});
// At comptime:
// 1. Type-check layer compatibility (128 matches between layers)
// 2. Build IR graph
// 3. Apply optimization passes
// 4. Generate backward pass
Compilation
var model = try Model.compile(.cuda, allocator);
// During compile():
// 1. Finalize IR (if not comptime)
// 2. Allocate device memory
// 3. Generate backend code (PTX)
// 4. Load kernels
// 5. Create execution plan
Execution
const output = try model.forward(input);
// During forward():
// 1. Copy input to device (if needed)
// 2. Launch fused kernels in sequence
// 3. Return output tensor
Data Structures
Tensor
pub const Tensor = struct {
data: DevicePtr,
shape: Shape,
stride: Stride,
dtype: DType,
device: Device,
requires_grad: bool,
pub fn item(self: Tensor) f32 { ... }
pub fn reshape(self: Tensor, new_shape: Shape) Tensor { ... }
// ...
};
Shape
pub const Shape = struct {
dims: [MAX_DIMS]usize,
ndim: u8,
pub fn numel(self: Shape) usize {
var n: usize = 1;
for (self.dims[0..self.ndim]) |d| n *= d;
return n;
}
};
Backend Interface
All backends implement the same interface:
pub const Backend = struct {
vtable: *const VTable,
context: *anyopaque,
pub const VTable = struct {
matmul: *const fn (*anyopaque, Tensor, Tensor) Tensor,
relu: *const fn (*anyopaque, Tensor) Tensor,
softmax: *const fn (*anyopaque, Tensor, usize) Tensor,
// ... all ops
};
};
This allows runtime backend selection and testing backend parity.
Performance Model
Napkin Math
Before implementing any operation, estimate its cost:
MatMul (M, K) @ (K, N):
- FLOPs: 2 * M * K * N
- Memory: (M*K + K*N + M*N) * sizeof(f32) bytes
- Arithmetic intensity: 2*M*K*N / (M*K + K*N + M*N)
Example: (1024, 1024) @ (1024, 1024)
- FLOPs: 2.15B
- Memory: 12 MB
- On RTX 4090 (82 TFLOPS, 1 TB/s):
- Compute bound if > 82 FLOPs/byte โ
- Memory bound: 12MB / 1TB/s = 12ยตs
- Actual should be ~12ยตs
Benchmarking
Every implementation is benchmarked:
=== MatMul 1024x1024 ===
CPU Scalar: 450ms (4.8 GFLOPS)
CPU AVX2: 112ms (19.2 GFLOPS) - 4.0x speedup
CUDA (RTX 4090): 0.5ms (4300 GFLOPS) - 900x speedup
Testing Strategy
See testing.md for full details.
Levels:
- Unit tests (each op)
- Backend parity (GPU matches CPU)
- Gradient checks (numerical vs autograd)
- Integration (full model training)
Future Architecture
Dynamic Shapes (v0.2)
Support runtime shape variation within bounds:
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 128),
// ... batch size determined at runtime
});
Distributed (zbmd integration)
Ztorch provides the compute engine, zbmd provides fault-tolerant distribution.
Quantization (v0.3)
Support int8, fp16, bfloat16 for inference acceleration.
Ztorch API Reference
Complete API documentation for Ztorch v0.1.
Core Types
Tensor
The fundamental data structure.
pub const Tensor = struct {
data: DevicePtr,
shape: Shape,
stride: Stride,
dtype: DType,
device: Device,
requires_grad: bool,
};
Creation
// Zeros
const t = try Tensor.zeros(.{32, 128}, .f32, .cpu);
// Ones
const t = try Tensor.ones(.{32, 128}, .f32, .cpu);
// Random normal distribution
const t = try Tensor.randn(.{32, 128}, .f32, .cpu);
// Random integers [0, high)
const t = try Tensor.randint(.{32}, 10, .cpu);
// From slice
const data = [_]f32{ 1, 2, 3, 4 };
const t = try Tensor.fromSlice(.{2, 2}, &data, .cpu);
Operations
// Reshape (view, no copy)
const reshaped = try t.reshape(.{64, 64});
// Transpose
const transposed = try t.transpose();
// Get single item (must be 1-element tensor)
const value: f32 = t.item();
// To slice (copies to CPU)
var buffer: [4]f32 = undefined;
try t.toSlice(&buffer);
Shape
pub const Shape = struct {
dims: [MAX_DIMS]usize,
ndim: u8,
};
// Create shape
const shape = Shape.init(&[_]usize{ 32, 128, 256 });
// Number of elements
const n = shape.numel(); // 32 * 128 * 256
Device
pub const Device = enum {
cpu,
cuda,
rocm,
vulkan,
};
DType
pub const DType = enum {
f32,
f64,
i32,
i64,
// more types in future versions
};
Model Definition
Sequential
Build a sequential model from layers.
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 128),
ztorch.ReLU(),
ztorch.Linear(128, 64),
ztorch.ReLU(),
ztorch.Linear(64, 10),
});
Layers
Linear
Fully connected layer: y = x @ W.T + b
pub fn Linear(comptime in_features: usize, comptime out_features: usize) type
Example:
ztorch.Linear(784, 128) // 784 โ 128
Activations
ReLU: y = max(0, x)
ztorch.ReLU()
GELU: Gaussian Error Linear Unit
ztorch.GELU()
Softmax: y[i] = exp(x[i]) / sum(exp(x))
ztorch.Softmax(.{.dim = -1}) // softmax over last dimension
Normalization
LayerNorm:
ztorch.LayerNorm(normalized_shape, .{
.eps = 1e-5,
.elementwise_affine = true,
})
Utilities
Data Loading
Helpers for turning binary dumps into tensors. Especially useful with the MNIST preparation script.
const cwd = std.fs.cwd();
var images = try ztorch.data.mnist.loadImages(cwd, "data/mnist_train_x.bin", allocator, .{
.max_samples = 1024, // 0 = entire file
});
defer images.deinit();
var labels = try ztorch.data.mnist.loadLabels(cwd, "data/mnist_train_y.bin", images.shape.dims[0], allocator, .{});
defer labels.deinit();
var iter = try ztorch.data.BatchIterator.init(&images, &labels, allocator, .{
.batch_size = 128,
.shuffle = true,
.seed = 42,
});
defer iter.deinit();
while (try iter.next()) |batch| {
var owned = batch;
defer owned.deinit();
// use owned.inputs / owned.labels.? tensors
}
Parameter Initializers
Create trainable tensors with the right gradient flags.
var weights = try ztorch.init.uniformParam(&[_]usize{ 784, 128 }, 0.05, allocator, 1);
var bias = try ztorch.init.zerosParam(&[_]usize{ 1, 128 }, allocator);
Metrics
Compute classification accuracy from logits or probabilities.
const train_acc = try ztorch.metrics.accuracyFromLogits(&logits, &labels, allocator);
var probs = try ztorch.ops.activations.softmax_cpu_scalar(&logits, -1, allocator);
defer probs.deinit();
const val_acc = try ztorch.metrics.accuracyFromProbabilities(&probs, &labels);
Checkpointing
Persist model parameters to disk and restore them later.
const entries = [_]ztorch.checkpoint.NamedTensor{
.{ .name = "w1", .tensor = &w1 },
.{ .name = "b1", .tensor = &b1 },
};
try ztorch.checkpoint.save("model.ckpt", entries[0..]);
var checkpoint = try ztorch.checkpoint.load("model.ckpt", allocator);
defer checkpoint.deinit();
const weights = checkpoint.get("w1") orelse unreachable;
MNIST Helpers
Utility functions for the built-in two-layer MNIST classifier.
const accuracy = try ztorch.models.mnist.evaluate(
&test_images,
&test_labels,
&w1,
&b1,
&w2,
&b2,
allocator,
);
Compilation
pub fn compile(
comptime self: anytype,
comptime backend: Device,
allocator: Allocator,
) !CompiledModel
Example:
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 10),
});
var model = try Model.compile(.cpu, allocator);
defer model.deinit();
Forward Pass
pub fn forward(self: *CompiledModel, input: Tensor) !Tensor
Example:
const input = try Tensor.randn(.{32, 784}, .f32, .cpu);
const output = try model.forward(input);
Loss Functions
Cross Entropy
pub fn crossEntropy(predictions: Tensor, targets: Tensor) !Tensor
Example:
const output = try model.forward(input);
const loss = try ztorch.crossEntropy(output, labels);
Mean Squared Error
pub fn mse(predictions: Tensor, targets: Tensor) !Tensor
Backward Pass
pub fn backward(self: *CompiledModel, loss: Tensor) !void
Computes gradients for all parameters with requires_grad = true.
Example:
const loss = try ztorch.crossEntropy(output, labels);
try model.backward(loss);
Optimization
Optimizer Step
pub fn step(self: *CompiledModel, config: OptimizerConfig) !void
pub const OptimizerConfig = union(enum) {
sgd: SGDConfig,
adam: AdamConfig,
};
SGD
pub const SGDConfig = struct {
lr: f32,
momentum: f32 = 0.0,
weight_decay: f32 = 0.0,
};
try model.step(.{ .sgd = .{ .lr = 0.01, .momentum = 0.9 } });
Adam
pub const AdamConfig = struct {
lr: f32,
beta1: f32 = 0.9,
beta2: f32 = 0.999,
eps: f32 = 1e-8,
weight_decay: f32 = 0.0,
};
try model.step(.{ .adam = .{ .lr = 0.001 } });
Complete Training Example
const std = @import("std");
const ztorch = @import("ztorch");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Define model
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 128),
ztorch.ReLU(),
ztorch.Linear(128, 64),
ztorch.ReLU(),
ztorch.Linear(64, 10),
});
// Compile
var model = try Model.compile(.cpu, allocator);
defer model.deinit();
// Training loop
const epochs = 10;
const batch_size = 32;
for (0..epochs) |epoch| {
var total_loss: f32 = 0;
var num_batches: usize = 0;
// ... load batch ...
const input = try Tensor.randn(.{batch_size, 784}, .f32, .cpu);
const labels = try Tensor.randint(.{batch_size}, 10, .cpu);
// Forward
const output = try model.forward(input);
// Loss
const loss = try ztorch.crossEntropy(output, labels);
total_loss += loss.item();
num_batches += 1;
// Backward
try model.backward(loss);
// Update
try model.step(.{ .adam = .{ .lr = 0.001 } });
std.debug.print("Epoch {}: Loss = {d:.4}\n", .{
epoch,
total_loss / @as(f32, @floatFromInt(num_batches)),
});
}
}
Backend-Specific Features
CUDA
// Use CUDA backend
var model = try Model.compile(.cuda, allocator);
// CUDA-specific device selection (future)
// ztorch.cuda.setDevice(0);
Memory Management
// Models manage their own memory
var model = try Model.compile(.cpu, allocator);
defer model.deinit(); // Frees all tensors
// Explicit tensor lifetime
const t = try Tensor.zeros(.{100}, .f32, .cpu);
defer t.deinit();
Error Handling
All fallible operations return errors:
pub const Error = error{
OutOfMemory,
DeviceError,
ShapeMismatch,
InvalidDType,
InvalidDevice,
BackendNotSupported,
};
Example:
const output = model.forward(input) catch |err| {
std.debug.print("Forward pass failed: {}\n", .{err});
return err;
};
Ztorch Operations Catalog
Complete specification of all operations in Ztorch.
Each operation includes:
- Mathematical definition
- Forward pass algorithm
- Backward pass (gradient) algorithm
- Implementation status
- Test requirements
- Performance characteristics
Matrix Operations
MatMul
Status: โ Reference CPU
Matrix multiplication: C = A @ B
Shapes:
A:(M, K)B:(K, N)C:(M, N)
Forward:
C[i,j] = sum_k(A[i,k] * B[k,j])
Backward:
d_A[i,k] = sum_j(d_C[i,j] * B[k,j])
d_B[k,j] = sum_i(A[i,k] * d_C[i,j])
Simplified:
d_A = d_C @ B.T
d_B = A.T @ d_C
FLOPs: 2 * M * K * N
Memory: (M*K + K*N + M*N) * sizeof(dtype)
Tests Required:
- Identity matrix
- Known result (2x2, 3x3)
- Large matrices (1024x1024)
- Non-square matrices
- Gradient check
Backend Status:
- CPU Scalar: โ
Implemented (
matmul_cpu_scalar) - CPU SIMD: โณ
- CUDA: โณ
Benchmarks:
bench/ops/matmul.zig(included inzig build bench)
Transpose
Status: โ Reference CPU
Matrix transpose: B = A.T
Shapes:
A:(M, N)B:(N, M)
Forward:
B[i,j] = A[j,i]
Backward:
d_A = d_B.T
Implementation Note: Can be a view (no data copy) with stride adjustment.
Activations
ReLU
Status: โ Reference CPU
Rectified Linear Unit: y = max(0, x)
Forward:
y[i] = max(0, x[i])
Backward:
d_x[i] = d_y[i] * (x[i] > 0 ? 1 : 0)
FLOPs: N (comparisons)
Tests Required:
- All positive input
- All negative input
- Mixed positive/negative
- Zero values
- Gradient check
Backend Status:
- CPU Scalar: โ
Implemented (
relu_cpu_scalar,relu_backward_cpu_scalar) - CPU SIMD: โณ
- CUDA: โณ
Benchmarks:
bench/ops/activations.zig(included inzig build bench)
PTX Implementation (reference):
// y = max(0, x)
max.f32 %out, %in, 0.0
GELU
Status: โณ Planned
Gaussian Error Linear Unit (approximation):
y = 0.5 * x * (1 + tanh(sqrt(2/ฯ) * (x + 0.044715 * x^3)))
Backward: (complex, see implementation)
FLOPs: ~10N (approximate)
Softmax
Status: โ Reference CPU
Softmax over dimension d:
y[i] = exp(x[i] - max(x)) / sum_j(exp(x[j] - max(x)))
Why subtract max: Numerical stability (prevent overflow)
Forward Algorithm:
1. max_val = max(x)
2. exp_vals[i] = exp(x[i] - max_val)
3. sum_exp = sum(exp_vals)
4. y[i] = exp_vals[i] / sum_exp
Backward: (complex Jacobian, see implementation)
FLOPs: ~5N
Tests Required:
- Uniform input (all same value)
- One large value (should be ~1.0)
- Output sums to 1.0
- Gradient check
Backend Status:
- CPU Scalar: โ
Implemented (
softmax_cpu_scalar) - CPU SIMD: โณ
- CUDA: โณ
Normalization
LayerNorm
Status: โณ Planned
Layer normalization:
y = (x - mean) / sqrt(var + eps) * gamma + beta
Forward Algorithm:
1. mean = sum(x) / N
2. var = sum((x - mean)^2) / N
3. x_norm = (x - mean) / sqrt(var + eps)
4. y = gamma * x_norm + beta
Backward: (chain rule through all operations)
FLOPs: ~5N
Parameters:
gamma: learned scale (shape: normalized_shape)beta: learned bias (shape: normalized_shape)
Tests Required:
- Known mean/variance
- Learnable parameters
- Gradient check
BatchNorm
Status: โณ Planned (v0.2)
Batch normalization (more complex due to training/inference modes)
Loss Functions
CrossEntropy
Status: โ Reference CPU
Cross-entropy loss over class logits with numerical stabilization:
logsumexp = log(sum(exp(logits - max(logits))))
loss = mean(logsumexp - logits[range, target])
Backward:
d_logits = (softmax(logits) - one_hot(target)) / batch_size
Backend Status:
- CPU Scalar: โ
Implemented (
cross_entropy_cpu_scalar) - CPU SIMD: โณ
- CUDA: โณ
Tests:
tests/ops/loss.zig
Forward Algorithm:
1. Apply softmax to predictions
2. loss = -log(probs[target_class])
Backward:
d_pred[i] = probs[i] - (i == target ? 1 : 0)
Tests Required:
- Perfect prediction (loss ~ 0)
- Random prediction (loss ~ log(num_classes))
- Gradient check
MSE
Status: โณ Planned
Mean squared error:
loss = mean((pred - target)^2)
Forward:
loss = sum((pred[i] - target[i])^2) / N
Backward:
d_pred[i] = 2 * (pred[i] - target[i]) / N
Element-wise Operations
Add
z = x + y
Shapes: Broadcasting supported
Forward: z[i] = x[i] + y[i]
Backward: d_x = d_z, d_y = d_z
Multiply
z = x * y
Forward: z[i] = x[i] * y[i]
Backward: d_x = d_z * y, d_y = d_z * x
Exp
y = exp(x)
Forward: y[i] = exp(x[i])
Backward: d_x[i] = d_y[i] * y[i]
Log
y = log(x)
Forward: y[i] = log(x[i])
Backward: d_x[i] = d_y[i] / x[i]
Reduction Operations
Sum
Sum over dimension(s)
Forward: y = sum(x, dim)
Backward: Broadcast d_y to shape of x
Mean
Mean over dimension(s)
Forward: y = sum(x, dim) / N
Backward: Broadcast d_y / N to shape of x
Max
Max over dimension(s)
Forward: y = max(x, dim)
Backward: Gradient flows only to max element (argmax mask)
Testing Requirements
Every operation must have:
- Unit tests with known values
test "matmul: 2x2 known result" {
// [[1,2],[3,4]] @ [[5,6],[7,8]] = [[19,22],[43,50]]
}
- Gradient checks
test "matmul: gradient check" {
// Compare autograd gradient vs numerical gradient
}
- Backend parity tests
test "matmul: cpu vs cuda" {
// Verify GPU output matches CPU (within epsilon)
}
- Benchmarks
bench "matmul: 1024x1024" {
// Measure GFLOPS
}
Optimizers
SGD
Status: โ Reference CPU
Stochastic Gradient Descent with constant learning rate.
Update Rule:
param -= lr * grad
Backend Status:
- CPU Scalar: โ
(
optim/sgd.zig) - CPU SIMD: โณ
- CUDA: โณ
Tests:
tests/integration/xor.zig
Implementation Checklist
For each operation:
- Mathematical definition documented
- Forward pass implemented (CPU scalar)
- Backward pass implemented (CPU scalar)
- Unit tests with known values
- Gradient check tests
- Benchmarked (baseline)
- CPU SIMD implementation
- CUDA implementation
- Backend parity tests
- Performance validated (vs napkin math)
Future Operations (v0.2+)
- Conv2D
- MaxPool2D
- Dropout
- Embedding
- Attention (fused)
- RMSNorm
Ztorch Backends
Backend implementation details and architecture.
Overview
Ztorch supports multiple backends, each optimized for different hardware:
- CPU Scalar - Reference implementation, obviously correct
- CPU SIMD - Vectorized with AVX2/AVX512 (x86) or NEON (ARM)
- CUDA - NVIDIA GPUs via PTX assembly
- ROCm - AMD GPUs via LLVM IR
- Vulkan - Cross-vendor via SPIR-V
Backend Interface
All backends implement the same interface:
pub const Backend = struct {
vtable: *const VTable,
context: *anyopaque,
pub const VTable = struct {
// Lifecycle
init: *const fn (Allocator) anyerror!*anyopaque,
deinit: *const fn (*anyopaque) void,
// Memory
alloc: *const fn (*anyopaque, usize) anyerror!DevicePtr,
free: *const fn (*anyopaque, DevicePtr) void,
copy_to_device: *const fn (*anyopaque, []const u8, DevicePtr) anyerror!void,
copy_from_device: *const fn (*anyopaque, DevicePtr, []u8) anyerror!void,
// Operations
matmul: *const fn (*anyopaque, Tensor, Tensor) anyerror!Tensor,
relu: *const fn (*anyopaque, Tensor) anyerror!Tensor,
softmax: *const fn (*anyopaque, Tensor, usize) anyerror!Tensor,
// ... all ops
};
};
CPU Scalar Backend
Purpose: Reference implementation for correctness.
Characteristics:
- Simple, readable code
- No SIMD, no assembly
- Serves as ground truth for testing
Example: MatMul
fn matmul_cpu_scalar(a: Tensor, b: Tensor, c: *Tensor) void {
const M = a.shape.dims[0];
const K = a.shape.dims[1];
const N = b.shape.dims[1];
for (0..M) |i| {
for (0..N) |j| {
var sum: f32 = 0;
for (0..K) |k| {
sum += a.data[i * K + k] * b.data[k * N + j];
}
c.data[i * N + j] = sum;
}
}
}
Expected Performance:
- MatMul 1024x1024: ~5 GFLOPS (single core)
CPU SIMD Backend
Purpose: Optimized CPU implementation using vector instructions.
Implementation:
- x86: AVX2 (256-bit) or AVX-512 (512-bit) intrinsics
- ARM: NEON (128-bit) intrinsics
- Runtime detection of CPU capabilities
- Fallback to scalar if unsupported
Example: ReLU with AVX2
fn relu_cpu_avx2(input: []f32, output: []f32) void {
const zero = @Vector(8, f32){0, 0, 0, 0, 0, 0, 0, 0};
var i: usize = 0;
while (i + 8 <= input.len) : (i += 8) {
const vec: @Vector(8, f32) = input[i..][0..8].*;
const result = @maximum(vec, zero);
output[i..][0..8].* = result;
}
// Handle remaining elements
while (i < input.len) : (i += 1) {
output[i] = @maximum(input[i], 0);
}
}
Expected Performance:
- 4-8x speedup over scalar (depending on operation)
- MatMul 1024x1024: ~20-40 GFLOPS (single core)
CUDA Backend
Purpose: High-performance execution on NVIDIA GPUs.
Architecture:
- Generate PTX assembly at comptime
- Load via CUDA Driver API
- Use tensor cores when available
Code Generation:
fn generateMatMulPTX(
comptime M: usize,
comptime N: usize,
comptime K: usize,
) [:0]const u8 {
comptime {
const tile_size = 32;
var ptx: []const u8 = ".version 8.5\n";
ptx = ptx ++ ".target sm_80\n"; // Ampere
ptx = ptx ++ ".address_size 64\n\n";
// Generate tiled matmul kernel
// Use wmma.mma for tensor cores
// ...
return ptx;
}
}
Optimization Techniques:
- Tiling for shared memory
- Tensor core usage (wmma instructions)
- Memory coalescing
- Bank conflict avoidance
Expected Performance:
- RTX 4090: 4000+ GFLOPS for MatMul
- Limited by memory bandwidth for smaller ops
Tensor Cores:
// Load into matrix fragments
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_a, [%ptr_a];
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_b, [%ptr_b];
// Multiply-accumulate
wmma.mma.sync.aligned.m16n16k16.f32.f32 %frag_c, %frag_a, %frag_b, %frag_c;
// Store result
wmma.store.sync.aligned.m16n16k16.global.f32 [%ptr_c], %frag_c;
ROCm Backend
Purpose: Support for AMD GPUs.
Architecture:
- Generate LLVM IR at comptime
- Compile via ROCm HIP toolchain
- Similar optimization strategies to CUDA
Code Generation:
fn generateMatMulLLVM(
comptime M: usize,
comptime N: usize,
comptime K: usize,
) [:0]const u8 {
comptime {
var llvm: []const u8 = "";
// LLVM IR for tiled matmul
llvm = llvm ++ "define void @matmul(...) {\n";
// ...
llvm = llvm ++ "}\n";
return llvm;
}
}
Expected Performance:
- Similar to CUDA (depends on GPU model)
- MI300X: 5000+ GFLOPS for MatMul
Vulkan Backend
Purpose: Cross-vendor GPU support (NVIDIA, AMD, Intel).
Architecture:
- Generate SPIR-V at comptime
- Use Vulkan compute shaders
- Portable across all GPUs
Trade-offs:
- More portable but less optimized than native backends
- No tensor core access (uses FP32 MAD operations)
- Good for inference, adequate for training
Code Generation:
fn generateMatMulSPIRV(
comptime M: usize,
comptime N: usize,
comptime K: usize,
) []const u8 {
comptime {
// SPIR-V assembly
var spirv: []const u8 = "";
// OpTypeFloat, OpTypeVector, etc.
// OpMatrixTimesVector or manual loop
// ...
return spirv;
}
}
Expected Performance:
- 50-80% of native backend performance
- Good enough for most inference workloads
Backend Selection
Compile-time
const Model = ztorch.Sequential(.{ /* ... */ });
var model = try Model.compile(.cuda, allocator); // Fixed at compile time
Runtime
const backend: Device = if (cuda_available) .cuda else .cpu;
var model = try Model.compile(backend, allocator);
Auto-detection
const backend = try ztorch.selectBestBackend(); // Detects available hardware
var model = try Model.compile(backend, allocator);
Testing Backend Parity
All backends must produce identical results (within floating-point precision).
test "matmul: cpu vs cuda parity" {
const input_a = try Tensor.randn(.{32, 64}, .f32, .cpu);
const input_b = try Tensor.randn(.{64, 128}, .f32, .cpu);
// CPU result
const output_cpu = try ops.matmul_cpu(input_a, input_b);
// CUDA result
const input_a_gpu = try input_a.to(.cuda);
const input_b_gpu = try input_b.to(.cuda);
const output_cuda = try ops.matmul_cuda(input_a_gpu, input_b_gpu);
const output_cuda_cpu = try output_cuda.to(.cpu);
// Compare
try testing.expectApproxEqSlice(
f32,
output_cpu.data,
output_cuda_cpu.data,
1e-4, // epsilon
);
}
Performance Validation
Every backend implementation must meet minimum performance requirements.
Napkin Math Target:
- CPU Scalar: Baseline
- CPU SIMD: >2x scalar
- CUDA: >10x CPU for N>1024
- ROCm: Similar to CUDA
- Vulkan: >5x CPU
Example Validation:
bench "matmul: 1024x1024 performance check" {
const result = try benchMatMul(.cuda, 1024, 1024, 1024);
// RTX 4090 theoretical: 82 TFLOPS
// 2*1024^3 FLOPs = 2.1B FLOPs
// Minimum 50% efficiency: 41 TFLOPS = 51 ยตs
try testing.expect(result.elapsed_ns < 51_000);
}
Future Backends
- Metal (Apple Silicon) - v0.3
- WebGPU (browser) - v0.4
- CPU Multi-threaded - v0.2 (OpenMP-style)
Ztorch Testing Strategy
Comprehensive testing approach following Tiger Style principles.
Testing Philosophy
- TDD from day 0 - Write tests before implementation
- Test all platforms - Linux, macOS, Windows on x86_64 and aarch64
- Reference-based verification - CPU scalar is ground truth
- No untested code - Every line exercised by tests
- Fail fast - Assertions in code, strict validation in tests
Test Categories
1. Unit Tests
Test individual operations with known inputs/outputs.
Location: test/ops/
Example:
// test/ops/matmul_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;
test "matmul: 2x2 identity matrix" {
// I @ A = A
const identity = [_]f32{ 1, 0, 0, 1 };
const matrix = [_]f32{ 5, 6, 7, 8 };
var result: [4]f32 = undefined;
ztorch.ops.matmul_cpu(.{2, 2}, &identity, &matrix, &result);
try testing.expectEqual(@as(f32, 5), result[0]);
try testing.expectEqual(@as(f32, 6), result[1]);
try testing.expectEqual(@as(f32, 7), result[2]);
try testing.expectEqual(@as(f32, 8), result[3]);
}
test "matmul: 2x2 known result" {
// [[1, 2], [3, 4]] @ [[5, 6], [7, 8]]
// = [[1*5+2*7, 1*6+2*8], [3*5+4*7, 3*6+4*8]]
// = [[19, 22], [43, 50]]
const a = [_]f32{ 1, 2, 3, 4 };
const b = [_]f32{ 5, 6, 7, 8 };
var c: [4]f32 = undefined;
ztorch.ops.matmul_cpu(.{2, 2}, &a, &b, &c);
const epsilon = 1e-5;
try testing.expectApproxEqAbs(@as(f32, 19), c[0], epsilon);
try testing.expectApproxEqAbs(@as(f32, 22), c[1], epsilon);
try testing.expectApproxEqAbs(@as(f32, 43), c[2], epsilon);
try testing.expectApproxEqAbs(@as(f32, 50), c[3], epsilon);
}
test "matmul: non-square matrices" {
// (2, 3) @ (3, 4) = (2, 4)
const a = [_]f32{ 1, 2, 3, 4, 5, 6 };
const b = [_]f32{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
var c: [8]f32 = undefined;
ztorch.ops.matmul_cpu(.{2, 3, 4}, &a, &b, &c);
// Verify all elements (computed manually)
const expected = [_]f32{ 38, 44, 50, 56, 83, 98, 113, 128 };
const epsilon = 1e-5;
for (expected, c) |exp, act| {
try testing.expectApproxEqAbs(exp, act, epsilon);
}
}
test "matmul: large matrix stress test" {
const allocator = testing.allocator;
const n = 1024;
var a = try allocator.alloc(f32, n * n);
defer allocator.free(a);
var b = try allocator.alloc(f32, n * n);
defer allocator.free(b);
var c = try allocator.alloc(f32, n * n);
defer allocator.free(c);
// Fill with known pattern
for (0..n*n) |i| {
a[i] = @floatFromInt(i % 100);
b[i] = @floatFromInt((i * 7) % 100);
}
// Should complete without error
ztorch.ops.matmul_cpu(.{n, n}, a, b, c);
// Spot check a few values (not full verification)
try testing.expect(c[0] != 0);
try testing.expect(!std.math.isNan(c[n*n - 1]));
}
Requirements:
- At least 3 tests per operation (simple, known result, edge cases)
- Cover edge cases (zeros, negatives, large values, etc.)
- Test various sizes (small, medium, large)
2. Gradient Check Tests
Verify autograd correctness using numerical differentiation.
Location: test/autograd/
Example:
// test/autograd/gradient_check_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;
fn numericalGradient(
comptime f: anytype,
x: []f32,
epsilon: f32,
) ![]f32 {
const allocator = testing.allocator;
var grad = try allocator.alloc(f32, x.len);
for (0..x.len) |i| {
// f(x + h)
x[i] += epsilon;
const f_plus = try f(x);
// f(x - h)
x[i] -= 2 * epsilon;
const f_minus = try f(x);
// (f(x+h) - f(x-h)) / 2h
grad[i] = (f_plus - f_minus) / (2 * epsilon);
// Restore
x[i] += epsilon;
}
return grad;
}
test "matmul: gradient check" {
const allocator = testing.allocator;
// Small matrices for numerical stability
const a = [_]f32{ 1, 2, 3, 4 };
const b = [_]f32{ 5, 6, 7, 8 };
// Forward pass
var c: [4]f32 = undefined;
ztorch.ops.matmul_cpu(.{2, 2}, &a, &b, &c);
// Backward pass (autograd)
const d_c = [_]f32{ 1, 1, 1, 1 }; // Gradient of loss
var d_a: [4]f32 = undefined;
var d_b: [4]f32 = undefined;
ztorch.ops.matmul_backward_cpu(.{2, 2}, &d_c, &a, &b, &d_a, &d_b);
// Numerical gradient
var a_copy = a;
const num_grad_a = try numericalGradient(
struct {
fn f(x: []f32) !f32 {
var tmp: [4]f32 = undefined;
ztorch.ops.matmul_cpu(.{2, 2}, x, &b, &tmp);
return tmp[0] + tmp[1] + tmp[2] + tmp[3]; // sum
}
}.f,
&a_copy,
1e-4,
);
defer allocator.free(num_grad_a);
// Compare autograd vs numerical
const epsilon = 1e-3; // Numerical gradients are approximate
for (d_a, num_grad_a) |auto_grad, num_grad| {
try testing.expectApproxEqAbs(auto_grad, num_grad, epsilon);
}
}
test "relu: gradient check" {
const input = [_]f32{ -2, -1, 0, 1, 2 };
var output: [5]f32 = undefined;
// Forward
ztorch.ops.relu_cpu(&input, &output);
// Backward
const d_output = [_]f32{ 1, 1, 1, 1, 1 };
var d_input: [5]f32 = undefined;
ztorch.ops.relu_backward_cpu(&d_output, &input, &d_input);
// Expected gradients: ReLU'(x) = x > 0 ? 1 : 0
const expected = [_]f32{ 0, 0, 0, 1, 1 };
for (expected, d_input) |exp, act| {
try testing.expectEqual(exp, act);
}
}
Requirements:
- Every differentiable operation must have gradient check
- Use numerical differentiation as reference
- Epsilon tolerance based on operation complexity
3. Backend Parity Tests
Verify all backends produce identical results.
Location: test/backends/
Example:
// test/backends/parity_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;
test "matmul: cpu scalar vs cpu simd" {
if (!ztorch.cpu.hasSimd()) return error.SkipZigTest;
const allocator = testing.allocator;
const n = 256;
// Random input
var a = try allocator.alloc(f32, n * n);
defer allocator.free(a);
var b = try allocator.alloc(f32, n * n);
defer allocator.free(b);
var rng = std.rand.DefaultPrng.init(42);
for (0..n*n) |i| {
a[i] = rng.random().float(f32) * 2 - 1; // [-1, 1]
b[i] = rng.random().float(f32) * 2 - 1;
}
// CPU scalar
var c_scalar = try allocator.alloc(f32, n * n);
defer allocator.free(c_scalar);
ztorch.ops.matmul_cpu_scalar(.{n, n}, a, b, c_scalar);
// CPU SIMD
var c_simd = try allocator.alloc(f32, n * n);
defer allocator.free(c_simd);
ztorch.ops.matmul_cpu_simd(.{n, n}, a, b, c_simd);
// Compare
const epsilon = 1e-4; // Allow small numerical differences
for (c_scalar, c_simd) |scalar, simd| {
try testing.expectApproxEqAbs(scalar, simd, epsilon);
}
}
test "matmul: cpu vs cuda" {
if (!ztorch.cuda.isAvailable()) return error.SkipZigTest;
const allocator = testing.allocator;
const n = 1024;
// Random input
var a = try allocator.alloc(f32, n * n);
defer allocator.free(a);
var b = try allocator.alloc(f32, n * n);
defer allocator.free(b);
var rng = std.rand.DefaultPrng.init(42);
for (0..n*n) |i| {
a[i] = rng.random().float(f32) * 2 - 1;
b[i] = rng.random().float(f32) * 2 - 1;
}
// CPU result
var c_cpu = try allocator.alloc(f32, n * n);
defer allocator.free(c_cpu);
ztorch.ops.matmul_cpu(.{n, n}, a, b, c_cpu);
// CUDA result
const a_gpu = try ztorch.cuda.allocAndCopy(a);
defer ztorch.cuda.free(a_gpu);
const b_gpu = try ztorch.cuda.allocAndCopy(b);
defer ztorch.cuda.free(b_gpu);
const c_gpu = try ztorch.cuda.alloc(n * n * @sizeOf(f32));
defer ztorch.cuda.free(c_gpu);
try ztorch.ops.matmul_cuda(.{n, n}, a_gpu, b_gpu, c_gpu);
var c_cuda = try allocator.alloc(f32, n * n);
defer allocator.free(c_cuda);
try ztorch.cuda.copyToHost(c_gpu, c_cuda);
// Compare
const epsilon = 1e-3; // GPU may have slightly different rounding
var max_diff: f32 = 0;
for (c_cpu, c_cuda) |cpu, cuda| {
const diff = @abs(cpu - cuda);
max_diff = @max(max_diff, diff);
try testing.expectApproxEqAbs(cpu, cuda, epsilon);
}
std.debug.print("Max difference: {d:.6}\n", .{max_diff});
}
Requirements:
- Test all backend combinations
- Use random inputs to catch edge cases
- Report maximum difference for debugging
4. Integration Tests
Test complete workflows (model training, inference).
Location: test/integration/
Example:
// test/integration/mnist_test.zig
const std = @import("std");
const ztorch = @import("ztorch");
const testing = std.testing;
test "integration: train simple MLP on synthetic data" {
const allocator = testing.allocator;
// Define model
const Model = ztorch.Sequential(.{
ztorch.Linear(10, 20),
ztorch.ReLU(),
ztorch.Linear(20, 2),
});
var model = try Model.compile(.cpu, allocator);
defer model.deinit();
// Synthetic data (linearly separable)
const batch_size = 32;
var input = try ztorch.Tensor.zeros(.{batch_size, 10}, .f32, .cpu);
defer input.deinit();
var labels = try ztorch.Tensor.zeros(.{batch_size}, .i32, .cpu);
defer labels.deinit();
// Fill with pattern
for (0..batch_size) |i| {
const label: i32 = if (i < batch_size / 2) 0 else 1;
labels.data[i] = label;
for (0..10) |j| {
input.data[i * 10 + j] = if (label == 0) -1.0 else 1.0;
}
}
// Train for a few steps
var initial_loss: f32 = 0;
var final_loss: f32 = 0;
for (0..100) |step| {
const output = try model.forward(input);
defer output.deinit();
const loss = try ztorch.crossEntropy(output, labels);
defer loss.deinit();
if (step == 0) initial_loss = loss.item();
if (step == 99) final_loss = loss.item();
try model.backward(loss);
try model.step(.{ .sgd = .{ .lr = 0.01 } });
}
// Loss should decrease
try testing.expect(final_loss < initial_loss);
// Should converge to low loss on this simple problem
try testing.expect(final_loss < 0.1);
}
test "integration: save and load model" {
// TODO: Implement serialization
return error.SkipZigTest;
}
Requirements:
- Test complete training loops
- Verify loss decreases
- Test inference
- Test model serialization (future)
5. Property-Based Tests
Test properties that should hold for all inputs.
Example:
test "property: matmul associativity" {
// (A @ B) @ C = A @ (B @ C)
const allocator = testing.allocator;
var rng = std.rand.DefaultPrng.init(42);
// Small matrices for speed
const n = 16;
var a = try allocator.alloc(f32, n * n);
defer allocator.free(a);
var b = try allocator.alloc(f32, n * n);
defer allocator.free(b);
var c = try allocator.alloc(f32, n * n);
defer allocator.free(c);
// Random values
for (0..n*n) |i| {
a[i] = rng.random().float(f32);
b[i] = rng.random().float(f32);
c[i] = rng.random().float(f32);
}
// (A @ B) @ C
var ab = try allocator.alloc(f32, n * n);
defer allocator.free(ab);
ztorch.ops.matmul_cpu(.{n, n}, a, b, ab);
var abc_left = try allocator.alloc(f32, n * n);
defer allocator.free(abc_left);
ztorch.ops.matmul_cpu(.{n, n}, ab, c, abc_left);
// A @ (B @ C)
var bc = try allocator.alloc(f32, n * n);
defer allocator.free(bc);
ztorch.ops.matmul_cpu(.{n, n}, b, c, bc);
var abc_right = try allocator.alloc(f32, n * n);
defer allocator.free(abc_right);
ztorch.ops.matmul_cpu(.{n, n}, a, bc, abc_right);
// Should be approximately equal
const epsilon = 1e-3; // Numerical error accumulates
for (abc_left, abc_right) |left, right| {
try testing.expectApproxEqAbs(left, right, epsilon);
}
}
test "property: softmax sums to 1" {
const allocator = testing.allocator;
var rng = std.rand.DefaultPrng.init(42);
const n = 100;
var input = try allocator.alloc(f32, n);
defer allocator.free(input);
var output = try allocator.alloc(f32, n);
defer allocator.free(output);
for (0..10) |_| {
// Random input
for (0..n) |i| {
input[i] = rng.random().float(f32) * 20 - 10; // [-10, 10]
}
ztorch.ops.softmax_cpu(input, output);
// Sum should be 1
var sum: f32 = 0;
for (output) |val| sum += val;
try testing.expectApproxEqAbs(@as(f32, 1.0), sum, 1e-5);
}
}
CI Configuration
.github/workflows/ci.yml:
name: CI
on:
push:
branches: [main, dev]
pull_request:
branches: [main]
jobs:
test:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
arch: [x86_64, aarch64]
exclude:
- os: windows-latest
arch: aarch64
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Setup Zig
uses: goto-bus-stop/setup-zig@v2
with:
version: master
- name: Check formatting
run: zig fmt --check .
- name: Build
run: zig build --summary all
- name: Run unit tests
run: zig build test --summary all
- name: Run integration tests
run: zig build test-integration --summary all
- name: Run benchmarks (smoke test)
run: zig build bench --summary all
env:
BENCH_ITERATIONS: 10 # Quick smoke test
test-cuda:
runs-on: ubuntu-latest
# Requires self-hosted runner with GPU
# if: github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- name: Setup Zig
uses: goto-bus-stop/setup-zig@v2
with:
version: master
- name: Setup CUDA
uses: Jimver/cuda-toolkit@v0.2.11
with:
cuda: "12.1.0"
- name: Run CUDA tests
run: zig build test-cuda --summary all
- name: Run backend parity tests
run: zig build test-parity --summary all
coverage:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Zig
uses: goto-bus-stop/setup-zig@v2
with:
version: master
- name: Run tests with coverage
run: zig build test -Dcoverage
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
files: ./zig-out/coverage.txt
Test Execution
# Run all tests
zig build test
# Run specific test file
zig build test -- test/ops/matmul_test.zig
# Run with verbose output
zig build test --summary all
# Run benchmarks
zig build bench
# Run only fast tests (no GPU, no integration)
zig build test-fast
# Run GPU tests only
zig build test-cuda
# Run backend parity tests
zig build test-parity
Test Coverage Requirements
- Minimum: 90% line coverage
- Target: 95%+ line coverage
- Every public API must be tested
- Every backend implementation must be tested
Continuous Benchmarking
Track performance over time to catch regressions.
# Run benchmarks and save results
zig build bench --output bench-results.json
# Compare against baseline
zig build bench-compare --baseline main
Example output:
=== Benchmark Comparison ===
MatMul 1024x1024:
main: 12.3 ยตs (baseline)
current: 11.8 ยตs (4.1% faster โ)
ReLU 1M elements:
main: 0.5 ms (baseline)
current: 0.6 ms (20% slower โ) <-- REGRESSION!
Test Writing Guidelines
- Name tests clearly
test "matmul: 2x2 identity matrix" // โ Good
test "test1" // โ Bad
-
One concept per test
- Test identity matrix separately from known result
- Makes failures easier to diagnose
-
Use descriptive assertions
try testing.expectEqual(@as(f32, 19), result[0]); // โ Shows expected
try testing.expect(result[0] == 19); // โ Less clear
- Clean up resources
var tensor = try Tensor.zeros(.{100}, .f32, .cpu);
defer tensor.deinit(); // Always cleanup
- Document complex setups
// Testing matmul with non-square matrices
// A: (2, 3), B: (3, 4), Expected C: (2, 4)
Performance Testing
See benchmarking.md for full details on performance testing.
Ztorch Benchmarking
Performance measurement and validation strategy.
Philosophy
- Napkin math first - Estimate before measuring
- Prove every gain - Optimizations must show measurable improvement
- Track regressions - Continuous benchmarking catches slowdowns
- Document results - Publish benchmarks for transparency
Napkin Math
Before implementing any operation, estimate its performance.
Example: MatMul
Operation: C = A @ B
Shapes: A(M, K), B(K, N), C(M, N)
Example: (1024, 1024) @ (1024, 1024)
FLOPs:
2 * M * K * N = 2 * 1024^3 = 2,147,483,648 FLOPs โ 2.1 GFLOPs
Memory:
Read: M*K + K*N = 1024*1024 + 1024*1024 = 2*1024^2 = 2,097,152 elements
Write: M*N = 1024*1024 = 1,048,576 elements
Total: 3 * 1024^2 * 4 bytes = 12 MB
GPU: RTX 4090
Peak compute: 82 TFLOPS (FP32)
Peak bandwidth: 1 TB/s
Compute bound if: FLOPs/byte > Peak FLOPs / Peak bandwidth
2.1 GFLOPs / 12 MB = 175 FLOPs/byte
82 TFLOPS / 1 TB/s = 82 FLOPs/byte
175 > 82, so compute bound โ
Expected time (compute bound):
2.1 GFLOPs / 82 TFLOPS = 25.6 ยตs
Expected time (memory bound):
12 MB / 1 TB/s = 12 ยตs
Realistically: ~25-50 ยตs (accounting for overhead)
CPU: Single core, 5 GHz
Peak (optimistic): ~100 GFLOPS (AVX2)
Realistic: ~20 GFLOPS
Expected time:
2.1 GFLOPs / 20 GFLOPS = 105 ms
The estimate gives us:
- Performance targets
- Expected speedup ratios
- Sanity checks
Benchmark Framework
Implementation
// bench/framework.zig
const std = @import("std");
const ztorch = @import("ztorch");
const duration = ztorch.util.duration;
pub const BenchResult = struct {
name: []const u8,
iterations: usize,
total_ns: u64,
mean_ns: u64,
median_ns: u64,
min_ns: u64,
max_ns: u64,
p99_ns: u64,
pub fn print(self: BenchResult) !void {
std.debug.print("=== Benchmark: {s} ===\n", .{self.name});
std.debug.print("Iterations: {}\n", .{self.iterations});
try printStat("Mean", self.mean_ns);
try printStat("Median", self.median_ns);
try printStat("Min", self.min_ns);
try printStat("Max", self.max_ns);
try printStat("P99", self.p99_ns);
}
};
pub fn benchmark(
comptime name: []const u8,
comptime iterations: usize,
func: anytype,
args: anytype,
) !BenchResult {
var times = try std.heap.page_allocator.alloc(u64, iterations);
defer std.heap.page_allocator.free(times);
const warmup_iterations: usize = if (iterations < 10) iterations else 10;
for (0..warmup_iterations) |_| {
try @call(.auto, func, args);
}
// Measure
for (0..iterations) |i| {
var timer = try std.time.Timer.start();
try @call(.auto, func, args);
times[i] = timer.read();
}
// Sort for median and percentiles
std.sort.pdq(u64, times, {}, comptime std.sort.asc(u64));
// Compute statistics
var total: u64 = 0;
for (times) |t| total += t;
return BenchResult{
.name = name,
.iterations = iterations,
.total_ns = total,
.mean_ns = total / iterations,
.median_ns = times[iterations / 2],
.min_ns = times[0],
.max_ns = times[iterations - 1],
.p99_ns = times[(iterations * 99) / 100],
};
}
fn printStat(label: []const u8, ns: u64) !void {
const text = try duration.formatDuration(std.heap.page_allocator, ns);
defer std.heap.page_allocator.free(text);
std.debug.print("{s}: {s}\n", .{ label, text });
}
The duration formatter is shared with the custom test runner, ensuring benchmarks and tests report timings in a consistent style.
Usage
const std = @import("std");
const framework = @import("../framework.zig");
fn step(allocator: std.mem.Allocator) !void {
// code-under-test goes here
_ = allocator;
}
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const result = try framework.benchmark(
"demo.step",
100,
step,
.{ gpa.allocator() },
);
try result.print();
}
Operation Benchmarks
MatMul Benchmarks
$ zig build bench
...
=== Benchmark: matmul.cpu_scalar ===
Iterations: 10
Mean: 52.52 ms
Median: 52.49 ms
Min: 52.27 ms
Max: 53.38 ms
P99: 53.38 ms
Size: 256x256, GFLOPS: 0.64
Activation Benchmarks
$ zig build bench
...
=== Benchmark: relu.cpu_scalar.forward/1M_f32 ===
Iterations: 200
Mean: 1.78 ms
Median: 1.79 ms
Min: 1.75 ms
Max: 1.84 ms
P99: 1.83 ms
Elements: 1000000, GFLOPS: 0.56
Comparison Benchmarks
Compare against established libraries.
// bench/comparison/vs_numpy.zig
pub fn benchmarkVsNumPy() !void {
// Generate test data
// Run NumPy matmul (via Python subprocess)
// Run Ztorch matmul
// Compare times
std.debug.print("NumPy: {d:.2} ms\n", .{numpy_time});
std.debug.print("Ztorch: {d:.2} ms\n", .{ztorch_time});
std.debug.print("Speedup: {d:.2}x\n", .{numpy_time / ztorch_time});
}
Continuous Benchmarking
Regression Detection
Run benchmarks on every commit and compare against baseline.
# Save baseline
zig build bench --save-baseline main.json
# Compare current against baseline
zig build bench --compare main.json
# Output
=== Regression Report ===
MatMul 1024x1024:
Baseline: 12.5 ms
Current: 15.3 ms
Change: +22.4% โ REGRESSION
ReLU 1M elements:
Baseline: 2.5 ms
Current: 2.4 ms
Change: -4.0% โ Improvement
CI Integration
# .github/workflows/bench.yml
name: Benchmark
on:
pull_request:
branches: [main]
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Need history for comparison
- name: Setup Zig
uses: goto-bus-stop/setup-zig@v2
- name: Download baseline
run: |
gh run download --name benchmark-baseline --repo ${{ github.repository }}
env:
GH_TOKEN: ${{ github.token }}
- name: Run benchmarks
run: zig build bench --compare baseline.json --output results.json
- name: Comment PR
uses: actions/github-script@v6
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results.json'));
let comment = '## Benchmark Results\n\n';
for (const result of results) {
const change = ((result.current - result.baseline) / result.baseline * 100).toFixed(1);
const emoji = change > 5 ? 'โ ๏ธ' : change < -5 ? 'โ
' : 'โ';
comment += `${emoji} ${result.name}: ${change}%\n`;
}
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
Performance Targets
Minimum Requirements (v0.1)
| Operation | Size | CPU Scalar | CPU SIMD | CUDA (RTX 4090) |
|---|---|---|---|---|
| MatMul | 1024ยฒ | 5 GFLOPS | >10 GFLOPS | >1000 GFLOPS |
| ReLU | 1M | 0.4 GFLOPS | >2 GFLOPS | >10 GFLOPS |
| Softmax | 1M | 0.07 GFLOPS | >0.2 GFLOPS | >5 GFLOPS |
Stretch Goals (v0.2)
- Match or exceed PyTorch performance on CPU
- Reach 50%+ of theoretical peak on GPU
- Sub-millisecond inference for small models
Publishing Results
All benchmark results are published in the repository.
benchmarks/results/
results/
โโโ cpu-scalar/
โ โโโ matmul.json
โ โโโ relu.json
โ โโโ ...
โโโ cpu-avx2/
โ โโโ ...
โโโ cuda-rtx4090/
โ โโโ ...
โโโ comparisons/
โโโ vs-pytorch.md
โโโ vs-numpy.md
Example: benchmarks/results/cpu-scalar/matmul.json
{
"operation": "matmul",
"backend": "cpu-scalar",
"hardware": "Intel i9-13980HX",
"date": "2024-11-06",
"results": [
{
"size": [1024, 1024, 1024],
"iterations": 100,
"mean_ns": 12500000,
"gflops": 0.17
}
]
}
Tools
# Run all benchmarks
zig build bench
# Run specific operation
zig build bench-matmul
zig build bench-activations
# Compare backends
zig build bench-compare-backends
# Generate report
zig build bench-report --output report.md
# Profile (Linux only)
zig build bench-matmul --profile
# Generates flamegraph.svg
Tiger Style for Ztorch
Ztorch follows Tiger Style development philosophy, adapted from TigerBeetle.
Core Principles
1. Safety First
No Undefined Behavior
// โ Good: Explicit size
const n: u32 = 1024;
// โ Bad: Architecture-dependent
const n: usize = 1024;
Fixed Limits
// โ Good: Bounded
const MAX_DIMS = 8;
for (0..@min(ndim, MAX_DIMS)) |i| { ... }
// โ Bad: Unbounded
while (has_more_dims()) { ... }
Static Allocation
// โ Good: Allocate at init
pub fn init(allocator: Allocator) !Model {
const weights = try allocator.alloc(f32, WEIGHT_SIZE);
return Model{ .weights = weights };
}
// โ Bad: Dynamic allocation in hot path
pub fn forward(self: *Model) !Tensor {
const temp = try self.allocator.alloc(f32, runtime_size); // โ
// ...
}
Explicit Error Handling
// โ Good: Handle all errors
const result = matmul(a, b) catch |err| {
log.err("MatMul failed: {}", .{err});
return err;
};
// โ Bad: Ignore errors
const result = matmul(a, b) catch unreachable; // โ
2. Performance
Napkin Math First
// Before implementing, calculate:
// MatMul (1024, 1024, 1024):
// FLOPs: 2 * 1024^3 = 2.1 GFLOPs
// Memory: 12 MB
// Expected time: ~50ยตs on RTX 4090
//
// Then implement and measure actual vs expected.
Batch Operations
// โ Good: Process multiple items
pub fn forward_batch(inputs: []Tensor) ![]Tensor {
// Single kernel launch for all inputs
}
// โ Bad: One at a time
for (inputs) |input| {
_ = try forward(input); // Multiple kernel launches
}
Optimize Resources in Order
- Network (if distributed)
- Disk (if I/O bound)
- Memory (usually the bottleneck for ML)
- CPU/GPU (optimize last)
3. Developer Experience
Clear Naming
// โ Good: Descriptive
pub fn matmul_cpu_scalar(a: Tensor, b: Tensor) Tensor { ... }
const latency_ms_max: f32 = 100.0;
// โ Bad: Abbreviated
pub fn mm(a: T, b: T) T { ... }
const max_lat: f32 = 100.0;
Document the Why
// โ Good: Explains reason
// We subtract max before exp() to prevent overflow.
// For input [1000, 1001], exp() would overflow, but
// exp([0, 1]) / sum(exp([0, 1])) gives same result.
const max_val = max(input);
for (input) |val| {
exp(val - max_val);
}
// โ Bad: Just describes what
// Subtract max
const max_val = max(input);
Organize Logically
// โ Good: Grouped by domain
src/
โโโ ops/ # Operations
โ โโโ matmul.zig
โ โโโ relu.zig
โ โโโ softmax.zig
โโโ backends/ # Backend implementations
โ โโโ cpu.zig
โ โโโ cuda.zig
โโโ ir/ # Internal representation
// โ Bad: Mixed concerns
src/
โโโ stuff.zig
โโโ more_stuff.zig
โโโ utils.zig
Specific Guidelines for Ztorch
Tensor Operations
Always Check Shapes
pub fn matmul(a: Tensor, b: Tensor) !Tensor {
// Tiger Style: Validate inputs
if (a.shape.dims[a.shape.ndim - 1] != b.shape.dims[0]) {
return error.ShapeMismatch;
}
// ...
}
Use Comptime When Possible
// โ Good: Shapes known at compile time
const Model = ztorch.Sequential(.{
ztorch.Linear(784, 128), // comptime validation
ztorch.ReLU(),
ztorch.Linear(128, 10), // 128 matches!
});
// โ Bad: Shapes only checked at runtime
var model = Model.init();
model.addLayer(Linear.init(784, 128));
model.addLayer(ReLU.init());
model.addLayer(Linear.init(64, 10)); // Oops, 64 != 128!
Backend Implementation
Reference Implementation First
// Step 1: CPU scalar (obviously correct)
pub fn relu_cpu_scalar(input: []f32, output: []f32) void {
for (input, output) |in, *out| {
out.* = @max(0, in);
}
}
// Step 2: Test thoroughly
test "relu: cpu scalar" { ... }
// Step 3: Optimize (SIMD)
pub fn relu_cpu_simd(input: []f32, output: []f32) void {
// ...
}
// Step 4: Verify against reference
test "relu: cpu simd vs scalar" {
try expectEqualSlices(scalar_result, simd_result);
}
Prove GPU Optimizations
# Before optimization
MatMul 1024x1024: 100ยตs (21 GFLOPS)
# After tiling optimization
MatMul 1024x1024: 25ยตs (84 GFLOPS) # 4x speedup โ
# Document in code:
// Tiled implementation achieves 84 GFLOPS vs 21 GFLOPS naive (4x).
// Uses 32x32 tiles to maximize shared memory reuse.
Error Handling
Fail Fast on Programmer Errors
pub fn forward(self: *Model, input: Tensor) !Tensor {
// Programmer error: wrong input shape
if (input.shape.dims[1] != self.input_size) {
// Tiger Style: This is a bug, not a runtime error
std.debug.panic(
"Input shape mismatch: expected {}, got {}",
.{self.input_size, input.shape.dims[1]}
);
}
// Runtime error: GPU out of memory
const output = self.backend.alloc(output_size) catch |err| {
// This can happen, return error
return err;
};
return output;
}
Memory Management
Explicit Lifetimes
// โ Good: Clear ownership
pub fn forward(self: *Model, input: Tensor) !Tensor {
const output = try Tensor.zeros(.{32, 10}, .f32, .cpu);
// Caller owns output, caller must free
return output;
}
pub fn example() !void {
const output = try model.forward(input);
defer output.deinit(); // Explicit cleanup
}
// โ Bad: Unclear ownership
pub fn forward(self: *Model, input: Tensor) !Tensor {
// Who frees this? Model? Caller?
const output = try self.allocator.create(Tensor);
// ...
}
Anti-Patterns to Avoid
Magic Numbers
// โ Bad
if (size > 1024) { ... }
// โ Good
const MAX_BATCH_SIZE = 1024;
if (size > MAX_BATCH_SIZE) { ... }
Premature Abstraction
// โ Bad: Over-engineered
pub const BackendFactory = struct {
pub fn create(comptime T: type) Backend(T) { ... }
};
// โ Good: Simple
pub fn createCpuBackend() Backend { ... }
pub fn createCudaBackend() Backend { ... }
Hidden Allocations
// โ Bad: Surprise allocation
pub fn concat(a: Tensor, b: Tensor) Tensor {
const result = Tensor.alloc(...); // Hidden!
// ...
}
// โ Good: Explicit
pub fn concat(allocator: Allocator, a: Tensor, b: Tensor) !Tensor {
const result = try Tensor.alloc(allocator, ...);
// ...
}
Ignoring Errors
// โ Bad
const result = riskyOperation() catch unreachable;
// โ Good
const result = riskyOperation() catch |err| {
log.err("Operation failed: {}", .{err});
return err;
};
Checklist for PRs
Before submitting, verify:
- All tests pass on all platforms
-
Code formatted with
zig fmt - No undefined behavior (checked with assertions)
- Fixed resource limits where applicable
- Error handling is explicit
- Napkin math documented for performance code
- Benchmarks prove optimizations
- Public APIs documented
- Complex logic has comments explaining "why"
- No magic numbers
- Memory ownership is clear
Resources
Questions?
If you're unsure whether something follows Tiger Style, ask in a PR or issue. We're here to help!
Ztorch Zig 0.15.x Quick Reference: DOs and DON'Ts
Language Syntax
โ DON'T use usingnamespace
// WRONG - Removed in 0.15.x
pub usingnamespace @import("other.zig");
โ DO use explicit declarations or conditionals
// RIGHT
pub const foo = other.foo;
pub const bar = other.bar;
// OR for conditional inclusion:
pub const init = if (condition) initWindows else initLinux;
I/O and Printing
โ DON'T use old generic writer API
// WRONG
const stdout = std.io.getStdOut().writer();
try stdout.print("...", .{});
โ DO provide explicit buffer to writer
// RIGHT
var stdout_buffer: [4096]u8 = undefined;
var stdout_writer = std.fs.File.stdout().writer(&stdout_buffer);
const stdout = &stdout_writer.interface;
try stdout.print("...", .{});
try stdout.flush(); // Always flush!
Format Strings
โ DON'T use {} for custom types with format methods
// WRONG
std.debug.print("{}", .{my_custom_type});
โ
DO use {f} to call format methods explicitly
// RIGHT
std.debug.print("{f}", .{my_custom_type});
โ DON'T pass format string to format() method
// WRONG
pub fn format(
self: @This(),
comptime fmt: []const u8,
options: std.fmt.FormatOptions,
writer: anytype,
) !void { ... }
โ DO use new simplified format signature
// RIGHT
pub fn format(self: @This(), writer: *std.Io.Writer) std.Io.Writer.Error!void {
try writer.print("value: {d}", .{self.value});
}
Casts
โ DON'T pass the target type to the cast builtin
// WRONG
const counter: u64 = @intCast(u64, input);
const seconds: f64 = @floatFromInt(f64, nanoseconds);
โ
DO use @as with the dedicated cast routines
// RIGHT
const counter = @as(u64, @intCast(input));
const seconds = @as(f64, @floatFromInt(nanoseconds));
These casts are now single-argument functions in Zig 0.15.x; wrap the result with @as to bind the target type.
โ DO remember that the helper returns the source type
The helper builtins (@intCast, @floatCast, @floatFromInt, โฆ) now return a value whose type is inferred from the operand.
Use @as (or the inferred type via var) to coerce to the destination.
const src_f16: f16 = 1.5;
const dst_f32 = @as(f32, @floatCast(src_f16)); // โ
explicit destination
const src_i16: i16 = 42;
var dst_i32: i32 = undefined;
dst_i32 = @as(i32, @intCast(src_i16)); // โ
widening cast
Skips to @floatCast(f32, src) or @intCast(i32, src) will not compile anymore.
ArrayList
โ DON'T expect allocator field in ArrayList
// WRONG - ArrayList doesn't store allocator anymore
var list = std.ArrayList(u8).init(allocator);
try list.append(item); // No allocator stored!
โ DO pass allocator to each operation
// RIGHT - Unmanaged by default
var list = std.ArrayList(u8){};
try list.append(allocator, item);
try list.appendSlice(allocator, items);
list.deinit(allocator);
// OR use Managed if you really want allocator field
var list = std.array_list.Managed(u8).init(allocator);
try list.append(item);
File Operations
โ DON'T use deprecated reader()/writer()
// WRONG
const file = try std.fs.cwd().openFile("file.txt", .{});
const reader = file.reader(); // Deprecated!
โ DO use new buffer-aware API
// RIGHT
const file = try std.fs.cwd().openFile("file.txt", .{});
var buffer: [4096]u8 = undefined;
var file_reader = file.reader(&buffer);
const reader = &file_reader.interface;
Inline Assembly Clobbers
โ DON'T use string clobbers
// WRONG
asm volatile ("syscall"
: [ret] "={rax}" (-> usize),
: [num] "{rax}" (number),
: "rcx", "r11" // String clobbers!
);
โ DO use typed clobbers
// RIGHT
asm volatile ("syscall"
: [ret] "={rax}" (-> usize),
: [num] "{rax}" (number),
: .{ .rcx = true, .r11 = true } // Typed!
);
Run zig fmt to auto-upgrade this!
Compression (flate/zlib/gzip)
โ DON'T use old compress API
// WRONG
var decompress = try std.compress.zlib.decompressor(reader);
โ DO use new unified flate API
// RIGHT
var decompress_buffer: [std.compress.flate.max_window_len]u8 = undefined;
var decompress: std.compress.flate.Decompress = .init(reader, .zlib, &decompress_buffer);
const decompress_reader = &decompress.reader;
Data Structures
โ DON'T use BoundedArray
// WRONG - Removed
var stack = try std.BoundedArray(i32, 8).fromSlice(items);
โ DO use ArrayList with stack buffer
// RIGHT
var buffer: [8]i32 = undefined;
var stack = std.ArrayListUnmanaged(i32).initBuffer(&buffer);
try stack.appendSliceBounded(items);
โ DON'T use generic LinkedList
// WRONG
var list = std.DoublyLinkedList(MyType).init();
โ DO use intrusive list with @fieldParentPtr
// RIGHT
const MyType = struct {
node: std.DoublyLinkedList.Node = .{},
data: i32,
};
var list: std.DoublyLinkedList = .{};
// Use @fieldParentPtr("node", node_ptr) to get back to MyType
Error Handling in undefined
โ DON'T do arithmetic on undefined
// WRONG - Compile error in 0.15.x
const a: u32 = 0;
const b: u32 = undefined;
const c = a + b; // ERROR: use of undefined causes illegal behavior
โ DO avoid operations on undefined
// RIGHT - Don't use undefined in arithmetic
const a: u32 = 0;
const b: u32 = 0; // Use actual value or @import("std").mem.zeroes(u32)
const c = a + b;
Pointers and Casting
โ DO use @ptrCast for single-item to slice
// NEW in 0.15.x - This now works!
const val: u32 = 1;
const bytes: []const u8 = @ptrCast(&val);
// Returns slice of same byte size as operand
โ DO use the new numeric casts
// ints -> floats
const items: usize = 42;
const weight: f32 = @as(f32, @floatFromInt(items));
// float width changes
const precise: f64 = 3.1415926535;
const pi_approx: f32 = @floatCast(precise);
// ints -> narrower ints (panics on overflow in Debug)
const index: usize = 128;
const idx_i32: i32 = @intCast(index);
โ DON'T rely on implicit coercion
// WRONG: no more implicit float narrowing
const precise: f64 = 3.14;
const pi: f32 = precise; // Compile error
โ DO use @as for explicit coercion
// RIGHT: @as documents intent and stays future proof
const precise: f64 = 3.14;
const pi: f32 = @floatCast(precise);
const total: usize = @as(usize, @intCast(@max(10, 5)));
Switch on Non-Exhaustive Enums
โ
DO mix explicit tags with _ prong
// NEW in 0.15.x - This is now allowed
switch (enum_val) {
.special_case_1 => foo(),
.special_case_2 => bar(),
_, .special_case_3 => baz(), // Mix unnamed (_) with named
}
Build System
โ DON'T use implicit root module fields
// WRONG - Removed in 0.15.x
const exe = b.addExecutable(.{
.name = "zros",
.root_source_file = b.path("src/main.zig"), // WRONG!
.target = target,
.optimize = optimize,
});
โ DO use explicit root_module
// RIGHT
const exe = b.addExecutable(.{
.name = "zros",
.root_module = b.createModule(.{
.root_source_file = b.path("src/main.zig"),
.target = target,
.optimize = optimize,
}),
});
Quick Migration Checklist
- โ
Remove all
usingnamespace(use conditionals or explicit imports) - โ Update all I/O code to provide explicit buffers
- โ
Change
{}to{f}for custom format methods - โ Remove format strings from format() method signatures
- โ Update ArrayList usage (pass allocator to operations)
- โ
Run
zig fmtto auto-fix inline assembly clobbers - โ Update build.zig to use root_module
- โ Replace BoundedArray with ArrayList + stack buffer
- โ Update compress/flate API usage
- โ Always flush() writers explicitly
For OS Development (Freestanding)
โ These work in freestanding mode:
std.mem.* // Memory operations
std.fmt.* // Formatting (with your writer)
std.debug.* // Assertions
std.math.* // Math
std.ArrayList // Data structures
std.HashMap // Data structures
@memcpy, @memset // Compiler builtins
โ These DON'T work in freestanding:
std.fs.* // No filesystem yet (you're building it!)
std.net.* // No network yet (you're building it!)
std.os.* // You ARE the OS
std.io.* // Use std.Io.Reader/Writer with your own backend
Kernel-Specific Tips
โ DO use explicit buffer management
// Perfect for kernel - no hidden allocations!
var buffer: [1024]u8 = undefined;
var writer = myDevice.writer(&buffer);
const w = &writer.interface;
try w.print("Kernel message\n", .{});
try w.flush();
โ DO use ArrayListUnmanaged for kernel data structures
// No hidden allocator field - perfect for kernel
var process_list = std.ArrayListUnmanaged(*Process){};
try process_list.append(my_allocator, new_process);
โ DO use @memcpy and @memset (compiler builtins)
// Optimized by LLVM for your target
@memcpy(dest, src);
@memset(buffer, 0);
Auto-Fix Tools
Run these to auto-migrate:
zig fmt # Fixes inline assembly clobbers
# Manual fixes needed for everything else (sorry!)
When in Doubt
- Check compiler errors - they're usually clear in 0.15.x
- Look at
lib/std/source code for examples - The new APIs are often simpler than old ones
- Explicit is better than implicit (Zig philosophy)
Remember: Most breaking changes make kernel development easier (explicit buffers, simpler APIs, faster compilation).
Contributing to Ztorch
Thank you for your interest in contributing to Ztorch!
Code of Conduct
Be respectful, constructive, and professional. We're building something useful together.
Development Philosophy
Ztorch follows Tiger Style development:
- Safety first
- Benchmarked performance
- Clean developer experience
- Zero technical debt
See tiger-style.md for full details.
Getting Started
Prerequisites
- Zig 0.15.x (latest)
- Git
- (Optional) CUDA Toolkit 12.0+ for GPU development
- (Optional) ROCm 5.0+ for AMD GPU development
Setup
# Clone
git clone https://github.com/mattneel/ztorch.git
cd ztorch
# Build
zig build
# Run tests
zig build test
# Run benchmarks
zig build bench
Contribution Workflow
1. Find or Create an Issue
- Check existing issues first
- For new features, discuss in an issue before implementing
- For bugs, provide minimal reproduction
2. Fork and Branch
git checkout -b feature/add-conv2d
# or
git checkout -b fix/matmul-gradient-bug
3. Implement
TDD Workflow:
- Write test first (red)
- Implement minimal code (green)
- Refactor
- Benchmark if performance-sensitive
Example: Adding a New Operation
// 1. Write test (test/ops/new_op_test.zig)
test "new_op: known result" {
const input = [_]f32{ 1, 2, 3 };
var output: [3]f32 = undefined;
ztorch.ops.new_op_cpu(&input, &output);
try testing.expectEqual(@as(f32, expected), output[0]);
}
// 2. Implement (src/ops/cpu/new_op.zig)
pub fn new_op_cpu(input: []const f32, output: []f32) void {
for (input, output) |in, *out| {
out.* = /* implementation */;
}
}
// 3. Add gradient test
test "new_op: gradient check" {
// ... numerical gradient verification
}
// 4. Benchmark
bench "new_op: 1M elements" {
// ... measure performance
}
// 5. Add to public API (src/ztorch.zig)
pub const new_op = ops.new_op_cpu;
4. Test
# Run all tests
zig build test
# Run specific test
zig build test -- test/ops/new_op_test.zig
# Check formatting
zig fmt --check .
# Fix formatting
zig fmt .
5. Benchmark
If your change affects performance:
# Run benchmarks
zig build bench
# Compare against main
git checkout main
zig build bench --save-baseline main.json
git checkout your-branch
zig build bench --compare main.json
Include benchmark results in your PR description.
6. Document
- Add docstrings to public functions
- Update README if adding features
- Update relevant docs/ files
- Add examples/ if appropriate
7. Commit
git add .
git commit -m "feat: add conv2d operation
- Implement CPU scalar version
- Add unit tests and gradient checks
- Benchmark: 2.3 GFLOPS on 32x32 kernels
- Refs #42"
Commit Message Format:
<type>: <subject>
<body>
<footer>
Types:
feat: New featurefix: Bug fixperf: Performance improvementtest: Adding testsdocs: Documentation onlyrefactor: Code change that neither fixes bug nor adds featurechore: Maintenance tasks
8. Push and PR
git push origin feature/add-conv2d
Create PR with:
- Clear description of what and why
- Link to related issue
- Test results
- Benchmark results (if applicable)
- Breaking changes (if any)
PR Review Process
- Automated Checks: CI must pass (all platforms)
- Code Review: Maintainer reviews code
- Discussion: Address feedback
- Approval: Maintainer approves
- Merge: Maintainer merges
Review Timeline:
- Initial response: 1-3 days
- Complete review: 1-2 weeks (depending on complexity)
What We're Looking For
High Priority
- Core operations (MatMul, Conv, Attention, etc.)
- Backend implementations (CUDA, ROCm, Vulkan)
- Performance optimizations (with benchmarks!)
- Bug fixes
- Test coverage improvements
- Documentation improvements
Medium Priority
- Additional frontends (PyTorch, TensorFlow import)
- Serialization/deserialization
- Quantization support
- More activation functions
Low Priority
- Minor refactors
- Code style changes
- Non-critical features
Guidelines
Code Style
Follow Tiger Style principles:
Do:
// Clear names
pub fn matmul_cpu_scalar(a: Tensor, b: Tensor) Tensor { ... }
// Explicit sizes
const n: u32 = 1024;
// Fixed limits
const MAX_BATCH_SIZE: usize = 1024;
for (0..@min(batch_size, MAX_BATCH_SIZE)) |i| { ... }
// Simple control flow
if (condition) {
simple_case();
} else {
other_case();
}
Don't:
// Vague names
pub fn mm(a: T, b: T) T { ... }
// Architecture-dependent sizes
const n: usize = 1024;
// Unbounded loops
while (has_more()) { ... }
// Complex nested logic
if (a) {
if (b) {
if (c) { ... }
}
}
Testing
Every PR must include:
- Unit tests for new functionality
- Gradient checks for differentiable ops
- Backend parity tests for GPU implementations
- Benchmarks for performance-sensitive code
Test Coverage:
- Minimum 90% line coverage
- All error paths tested
- Edge cases covered
Documentation
/// Matrix multiplication: C = A @ B
///
/// Computes the matrix product of two 2D tensors.
///
/// # Arguments
/// * `a` - Left matrix of shape (M, K)
/// * `b` - Right matrix of shape (K, N)
///
/// # Returns
/// Result matrix of shape (M, N)
///
/// # Example
/// ```
/// const a = try Tensor.randn(.{32, 64});
/// const b = try Tensor.randn(.{64, 128});
/// const c = try matmul(a, b);
/// ```
pub fn matmul(a: Tensor, b: Tensor) !Tensor {
// ...
}
Performance
- Always napkin math first
- Benchmark before and after
- Document expected vs actual performance
- Prove optimizations with numbers
Safety
- No undefined behavior
- Explicit error handling
- Fixed resource limits
- Assertions for invariants
Common Issues
CI Failures
"Test failed on Windows"
- Windows has different line endings
- Use
\nconsistently - Run tests locally on Windows if possible
"Formatting check failed"
zig fmt .
git add .
git commit --amend --no-edit
git push --force
"Benchmark regression"
- Investigate why performance decreased
- Either fix the regression or justify it
- Update baseline if intentional
Review Feedback
"This needs tests"
- Add missing test cases
- Ensure coverage is adequate
"Can you add a benchmark?"
- Add benchmark for the new code
- Compare against baseline
"This violates Tiger Style"
- Review tiger-style.md
- Refactor accordingly
Getting Help
- Questions: Open a discussion
- Bugs: Open an issue
- Features: Open an issue for discussion first
- Chat: (Discord/Matrix link if available)
Recognition
Contributors are recognized in:
- CONTRIBUTORS.md file
- Release notes
- Git history
Significant contributions may earn you commit access.
License
By contributing, you agree that your contributions will be licensed under the same license as the project (Apache 2.0 or MIT).