Ztorch Operations Catalog

Complete specification of all operations in Ztorch.

Each operation includes:

Mathematical definition
Forward pass algorithm
Backward pass (gradient) algorithm
Implementation status
Test requirements
Performance characteristics

Matrix Operations

MatMul

Status: ✅ Reference CPU

Matrix multiplication: C = A @ B

Shapes:

A: (M, K)
B: (K, N)
C: (M, N)

Forward:

C[i,j] = sum_k(A[i,k] * B[k,j])

Backward:

d_A[i,k] = sum_j(d_C[i,j] * B[k,j])
d_B[k,j] = sum_i(A[i,k] * d_C[i,j])

Simplified:
d_A = d_C @ B.T
d_B = A.T @ d_C

FLOPs: 2 * M * K * N

Memory: (M*K + K*N + M*N) * sizeof(dtype)

Tests Required:

Identity matrix
Known result (2x2, 3x3)
Large matrices (1024x1024)
Non-square matrices
Gradient check

Backend Status:

CPU Scalar: ✅ Implemented (matmul_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Benchmarks:

bench/ops/matmul.zig (included in zig build bench)

Transpose

Status: ✅ Reference CPU

Matrix transpose: B = A.T

Shapes:

A: (M, N)
B: (N, M)

Forward:

B[i,j] = A[j,i]

Backward:

d_A = d_B.T

Implementation Note: Can be a view (no data copy) with stride adjustment.

Activations

ReLU

Status: ✅ Reference CPU

Rectified Linear Unit: y = max(0, x)

Forward:

y[i] = max(0, x[i])

Backward:

d_x[i] = d_y[i] * (x[i] > 0 ? 1 : 0)

FLOPs: N (comparisons)

Tests Required:

All positive input
All negative input
Mixed positive/negative
Zero values
Gradient check

Backend Status:

CPU Scalar: ✅ Implemented (relu_cpu_scalar, relu_backward_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Benchmarks:

bench/ops/activations.zig (included in zig build bench)

PTX Implementation (reference):

// y = max(0, x)
max.f32 %out, %in, 0.0

GELU

Status: ⏳ Planned

Gaussian Error Linear Unit (approximation):

y = 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))

Backward: (complex, see implementation)

FLOPs: ~10N (approximate)

Softmax

Status: ✅ Reference CPU

Softmax over dimension d:

y[i] = exp(x[i] - max(x)) / sum_j(exp(x[j] - max(x)))

Why subtract max: Numerical stability (prevent overflow)

Forward Algorithm:

1. max_val = max(x)
2. exp_vals[i] = exp(x[i] - max_val)
3. sum_exp = sum(exp_vals)
4. y[i] = exp_vals[i] / sum_exp

Backward: (complex Jacobian, see implementation)

FLOPs: ~5N

Tests Required:

Uniform input (all same value)
One large value (should be ~1.0)
Output sums to 1.0
Gradient check

Backend Status:

CPU Scalar: ✅ Implemented (softmax_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Normalization

LayerNorm

Status: ⏳ Planned

Layer normalization:

y = (x - mean) / sqrt(var + eps) * gamma + beta

Forward Algorithm:

1. mean = sum(x) / N
2. var = sum((x - mean)^2) / N
3. x_norm = (x - mean) / sqrt(var + eps)
4. y = gamma * x_norm + beta

Backward: (chain rule through all operations)

FLOPs: ~5N

Parameters:

gamma: learned scale (shape: normalized_shape)
beta: learned bias (shape: normalized_shape)

Tests Required:

Known mean/variance
Learnable parameters
Gradient check

BatchNorm

Status: ⏳ Planned (v0.2)

Batch normalization (more complex due to training/inference modes)

Loss Functions

CrossEntropy

Status: ✅ Reference CPU

Cross-entropy loss over class logits with numerical stabilization:

logsumexp = log(sum(exp(logits - max(logits))))
loss = mean(logsumexp - logits[range, target])

Backward:

d_logits = (softmax(logits) - one_hot(target)) / batch_size

Backend Status:

CPU Scalar: ✅ Implemented (cross_entropy_cpu_scalar)
CPU SIMD: ⏳
CUDA: ⏳

Tests:

tests/ops/loss.zig

Forward Algorithm:

1. Apply softmax to predictions
2. loss = -log(probs[target_class])

Backward:

d_pred[i] = probs[i] - (i == target ? 1 : 0)

Tests Required:

Perfect prediction (loss ~ 0)
Random prediction (loss ~ log(num_classes))
Gradient check

MSE

Status: ⏳ Planned

Mean squared error:

loss = mean((pred - target)^2)

Forward:

loss = sum((pred[i] - target[i])^2) / N

Backward:

d_pred[i] = 2 * (pred[i] - target[i]) / N

Element-wise Operations

Add

z = x + y

Shapes: Broadcasting supported

Forward: z[i] = x[i] + y[i]

Backward: d_x = d_z, d_y = d_z

Multiply

z = x * y

Forward: z[i] = x[i] * y[i]

Backward: d_x = d_z * y, d_y = d_z * x

Exp

y = exp(x)

Forward: y[i] = exp(x[i])

Backward: d_x[i] = d_y[i] * y[i]

Log

y = log(x)

Forward: y[i] = log(x[i])

Backward: d_x[i] = d_y[i] / x[i]

Reduction Operations

Sum

Sum over dimension(s)

Forward: y = sum(x, dim)

Backward: Broadcast d_y to shape of x

Mean

Mean over dimension(s)

Forward: y = sum(x, dim) / N

Backward: Broadcast d_y / N to shape of x

Max

Max over dimension(s)

Forward: y = max(x, dim)

Backward: Gradient flows only to max element (argmax mask)

Testing Requirements

Every operation must have:

Unit tests with known values

   test "matmul: 2x2 known result" {
       // [[1,2],[3,4]] @ [[5,6],[7,8]] = [[19,22],[43,50]]
   }

Gradient checks

   test "matmul: gradient check" {
       // Compare autograd gradient vs numerical gradient
   }

Backend parity tests

   test "matmul: cpu vs cuda" {
       // Verify GPU output matches CPU (within epsilon)
   }

Benchmarks

   bench "matmul: 1024x1024" {
       // Measure GFLOPS
   }

Optimizers

SGD

Status: ✅ Reference CPU

Stochastic Gradient Descent with constant learning rate.

Update Rule:

param -= lr * grad

Backend Status:

CPU Scalar: ✅ (optim/sgd.zig)
CPU SIMD: ⏳
CUDA: ⏳

Tests:

tests/integration/xor.zig

Implementation Checklist

For each operation:

Mathematical definition documented
Forward pass implemented (CPU scalar)
Backward pass implemented (CPU scalar)
Unit tests with known values
Gradient check tests
Benchmarked (baseline)
CPU SIMD implementation
CUDA implementation
Backend parity tests
Performance validated (vs napkin math)

Future Operations (v0.2+)

Conv2D
MaxPool2D
Dropout
Embedding
Attention (fused)
RMSNorm

Keyboard shortcuts

Ztorch Documentation