Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Ztorch Operations Catalog

Complete specification of all operations in Ztorch.

Each operation includes:

  • Mathematical definition
  • Forward pass algorithm
  • Backward pass (gradient) algorithm
  • Implementation status
  • Test requirements
  • Performance characteristics

Matrix Operations

MatMul

Status: ✅ Reference CPU

Matrix multiplication: C = A @ B

Shapes:

  • A: (M, K)
  • B: (K, N)
  • C: (M, N)

Forward:

C[i,j] = sum_k(A[i,k] * B[k,j])

Backward:

d_A[i,k] = sum_j(d_C[i,j] * B[k,j])
d_B[k,j] = sum_i(A[i,k] * d_C[i,j])

Simplified:
d_A = d_C @ B.T
d_B = A.T @ d_C

FLOPs: 2 * M * K * N

Memory: (M*K + K*N + M*N) * sizeof(dtype)

Tests Required:

  • Identity matrix
  • Known result (2x2, 3x3)
  • Large matrices (1024x1024)
  • Non-square matrices
  • Gradient check

Backend Status:

  • CPU Scalar: ✅ Implemented (matmul_cpu_scalar)
  • CPU SIMD: ⏳
  • CUDA: ⏳

Benchmarks:

  • bench/ops/matmul.zig (included in zig build bench)

Transpose

Status: ✅ Reference CPU

Matrix transpose: B = A.T

Shapes:

  • A: (M, N)
  • B: (N, M)

Forward:

B[i,j] = A[j,i]

Backward:

d_A = d_B.T

Implementation Note: Can be a view (no data copy) with stride adjustment.

Activations

ReLU

Status: ✅ Reference CPU

Rectified Linear Unit: y = max(0, x)

Forward:

y[i] = max(0, x[i])

Backward:

d_x[i] = d_y[i] * (x[i] > 0 ? 1 : 0)

FLOPs: N (comparisons)

Tests Required:

  • All positive input
  • All negative input
  • Mixed positive/negative
  • Zero values
  • Gradient check

Backend Status:

  • CPU Scalar: ✅ Implemented (relu_cpu_scalar, relu_backward_cpu_scalar)
  • CPU SIMD: ⏳
  • CUDA: ⏳

Benchmarks:

  • bench/ops/activations.zig (included in zig build bench)

PTX Implementation (reference):

// y = max(0, x)
max.f32 %out, %in, 0.0

GELU

Status: ⏳ Planned

Gaussian Error Linear Unit (approximation):

y = 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))

Backward: (complex, see implementation)

FLOPs: ~10N (approximate)

Softmax

Status: ✅ Reference CPU

Softmax over dimension d:

y[i] = exp(x[i] - max(x)) / sum_j(exp(x[j] - max(x)))

Why subtract max: Numerical stability (prevent overflow)

Forward Algorithm:

1. max_val = max(x)
2. exp_vals[i] = exp(x[i] - max_val)
3. sum_exp = sum(exp_vals)
4. y[i] = exp_vals[i] / sum_exp

Backward: (complex Jacobian, see implementation)

FLOPs: ~5N

Tests Required:

  • Uniform input (all same value)
  • One large value (should be ~1.0)
  • Output sums to 1.0
  • Gradient check

Backend Status:

  • CPU Scalar: ✅ Implemented (softmax_cpu_scalar)
  • CPU SIMD: ⏳
  • CUDA: ⏳

Normalization

LayerNorm

Status: ⏳ Planned

Layer normalization:

y = (x - mean) / sqrt(var + eps) * gamma + beta

Forward Algorithm:

1. mean = sum(x) / N
2. var = sum((x - mean)^2) / N
3. x_norm = (x - mean) / sqrt(var + eps)
4. y = gamma * x_norm + beta

Backward: (chain rule through all operations)

FLOPs: ~5N

Parameters:

  • gamma: learned scale (shape: normalized_shape)
  • beta: learned bias (shape: normalized_shape)

Tests Required:

  • Known mean/variance
  • Learnable parameters
  • Gradient check

BatchNorm

Status: ⏳ Planned (v0.2)

Batch normalization (more complex due to training/inference modes)

Loss Functions

CrossEntropy

Status: ✅ Reference CPU

Cross-entropy loss over class logits with numerical stabilization:

logsumexp = log(sum(exp(logits - max(logits))))
loss = mean(logsumexp - logits[range, target])

Backward:

d_logits = (softmax(logits) - one_hot(target)) / batch_size

Backend Status:

  • CPU Scalar: ✅ Implemented (cross_entropy_cpu_scalar)
  • CPU SIMD: ⏳
  • CUDA: ⏳

Tests:

  • tests/ops/loss.zig

Forward Algorithm:

1. Apply softmax to predictions
2. loss = -log(probs[target_class])

Backward:

d_pred[i] = probs[i] - (i == target ? 1 : 0)

Tests Required:

  • Perfect prediction (loss ~ 0)
  • Random prediction (loss ~ log(num_classes))
  • Gradient check

MSE

Status: ⏳ Planned

Mean squared error:

loss = mean((pred - target)^2)

Forward:

loss = sum((pred[i] - target[i])^2) / N

Backward:

d_pred[i] = 2 * (pred[i] - target[i]) / N

Element-wise Operations

Add

z = x + y

Shapes: Broadcasting supported

Forward: z[i] = x[i] + y[i]

Backward: d_x = d_z, d_y = d_z

Multiply

z = x * y

Forward: z[i] = x[i] * y[i]

Backward: d_x = d_z * y, d_y = d_z * x

Exp

y = exp(x)

Forward: y[i] = exp(x[i])

Backward: d_x[i] = d_y[i] * y[i]

Log

y = log(x)

Forward: y[i] = log(x[i])

Backward: d_x[i] = d_y[i] / x[i]

Reduction Operations

Sum

Sum over dimension(s)

Forward: y = sum(x, dim)

Backward: Broadcast d_y to shape of x

Mean

Mean over dimension(s)

Forward: y = sum(x, dim) / N

Backward: Broadcast d_y / N to shape of x

Max

Max over dimension(s)

Forward: y = max(x, dim)

Backward: Gradient flows only to max element (argmax mask)

Testing Requirements

Every operation must have:

  1. Unit tests with known values
   test "matmul: 2x2 known result" {
       // [[1,2],[3,4]] @ [[5,6],[7,8]] = [[19,22],[43,50]]
   }
  1. Gradient checks
   test "matmul: gradient check" {
       // Compare autograd gradient vs numerical gradient
   }
  1. Backend parity tests
   test "matmul: cpu vs cuda" {
       // Verify GPU output matches CPU (within epsilon)
   }
  1. Benchmarks
   bench "matmul: 1024x1024" {
       // Measure GFLOPS
   }

Optimizers

SGD

Status: ✅ Reference CPU

Stochastic Gradient Descent with constant learning rate.

Update Rule:

param -= lr * grad

Backend Status:

  • CPU Scalar: ✅ (optim/sgd.zig)
  • CPU SIMD: ⏳
  • CUDA: ⏳

Tests:

  • tests/integration/xor.zig

Implementation Checklist

For each operation:

  • Mathematical definition documented
  • Forward pass implemented (CPU scalar)
  • Backward pass implemented (CPU scalar)
  • Unit tests with known values
  • Gradient check tests
  • Benchmarked (baseline)
  • CPU SIMD implementation
  • CUDA implementation
  • Backend parity tests
  • Performance validated (vs napkin math)

Future Operations (v0.2+)

  • Conv2D
  • MaxPool2D
  • Dropout
  • Embedding
  • Attention (fused)
  • RMSNorm