Ztorch Operations Catalog
Complete specification of all operations in Ztorch.
Each operation includes:
- Mathematical definition
- Forward pass algorithm
- Backward pass (gradient) algorithm
- Implementation status
- Test requirements
- Performance characteristics
Matrix Operations
MatMul
Status: ✅ Reference CPU
Matrix multiplication: C = A @ B
Shapes:
A:(M, K)B:(K, N)C:(M, N)
Forward:
C[i,j] = sum_k(A[i,k] * B[k,j])
Backward:
d_A[i,k] = sum_j(d_C[i,j] * B[k,j])
d_B[k,j] = sum_i(A[i,k] * d_C[i,j])
Simplified:
d_A = d_C @ B.T
d_B = A.T @ d_C
FLOPs: 2 * M * K * N
Memory: (M*K + K*N + M*N) * sizeof(dtype)
Tests Required:
- Identity matrix
- Known result (2x2, 3x3)
- Large matrices (1024x1024)
- Non-square matrices
- Gradient check
Backend Status:
- CPU Scalar: ✅ Implemented (
matmul_cpu_scalar) - CPU SIMD: ⏳
- CUDA: ⏳
Benchmarks:
bench/ops/matmul.zig(included inzig build bench)
Transpose
Status: ✅ Reference CPU
Matrix transpose: B = A.T
Shapes:
A:(M, N)B:(N, M)
Forward:
B[i,j] = A[j,i]
Backward:
d_A = d_B.T
Implementation Note: Can be a view (no data copy) with stride adjustment.
Activations
ReLU
Status: ✅ Reference CPU
Rectified Linear Unit: y = max(0, x)
Forward:
y[i] = max(0, x[i])
Backward:
d_x[i] = d_y[i] * (x[i] > 0 ? 1 : 0)
FLOPs: N (comparisons)
Tests Required:
- All positive input
- All negative input
- Mixed positive/negative
- Zero values
- Gradient check
Backend Status:
- CPU Scalar: ✅ Implemented (
relu_cpu_scalar,relu_backward_cpu_scalar) - CPU SIMD: ⏳
- CUDA: ⏳
Benchmarks:
bench/ops/activations.zig(included inzig build bench)
PTX Implementation (reference):
// y = max(0, x)
max.f32 %out, %in, 0.0
GELU
Status: ⏳ Planned
Gaussian Error Linear Unit (approximation):
y = 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))
Backward: (complex, see implementation)
FLOPs: ~10N (approximate)
Softmax
Status: ✅ Reference CPU
Softmax over dimension d:
y[i] = exp(x[i] - max(x)) / sum_j(exp(x[j] - max(x)))
Why subtract max: Numerical stability (prevent overflow)
Forward Algorithm:
1. max_val = max(x)
2. exp_vals[i] = exp(x[i] - max_val)
3. sum_exp = sum(exp_vals)
4. y[i] = exp_vals[i] / sum_exp
Backward: (complex Jacobian, see implementation)
FLOPs: ~5N
Tests Required:
- Uniform input (all same value)
- One large value (should be ~1.0)
- Output sums to 1.0
- Gradient check
Backend Status:
- CPU Scalar: ✅ Implemented (
softmax_cpu_scalar) - CPU SIMD: ⏳
- CUDA: ⏳
Normalization
LayerNorm
Status: ⏳ Planned
Layer normalization:
y = (x - mean) / sqrt(var + eps) * gamma + beta
Forward Algorithm:
1. mean = sum(x) / N
2. var = sum((x - mean)^2) / N
3. x_norm = (x - mean) / sqrt(var + eps)
4. y = gamma * x_norm + beta
Backward: (chain rule through all operations)
FLOPs: ~5N
Parameters:
gamma: learned scale (shape: normalized_shape)beta: learned bias (shape: normalized_shape)
Tests Required:
- Known mean/variance
- Learnable parameters
- Gradient check
BatchNorm
Status: ⏳ Planned (v0.2)
Batch normalization (more complex due to training/inference modes)
Loss Functions
CrossEntropy
Status: ✅ Reference CPU
Cross-entropy loss over class logits with numerical stabilization:
logsumexp = log(sum(exp(logits - max(logits))))
loss = mean(logsumexp - logits[range, target])
Backward:
d_logits = (softmax(logits) - one_hot(target)) / batch_size
Backend Status:
- CPU Scalar: ✅ Implemented (
cross_entropy_cpu_scalar) - CPU SIMD: ⏳
- CUDA: ⏳
Tests:
tests/ops/loss.zig
Forward Algorithm:
1. Apply softmax to predictions
2. loss = -log(probs[target_class])
Backward:
d_pred[i] = probs[i] - (i == target ? 1 : 0)
Tests Required:
- Perfect prediction (loss ~ 0)
- Random prediction (loss ~ log(num_classes))
- Gradient check
MSE
Status: ⏳ Planned
Mean squared error:
loss = mean((pred - target)^2)
Forward:
loss = sum((pred[i] - target[i])^2) / N
Backward:
d_pred[i] = 2 * (pred[i] - target[i]) / N
Element-wise Operations
Add
z = x + y
Shapes: Broadcasting supported
Forward: z[i] = x[i] + y[i]
Backward: d_x = d_z, d_y = d_z
Multiply
z = x * y
Forward: z[i] = x[i] * y[i]
Backward: d_x = d_z * y, d_y = d_z * x
Exp
y = exp(x)
Forward: y[i] = exp(x[i])
Backward: d_x[i] = d_y[i] * y[i]
Log
y = log(x)
Forward: y[i] = log(x[i])
Backward: d_x[i] = d_y[i] / x[i]
Reduction Operations
Sum
Sum over dimension(s)
Forward: y = sum(x, dim)
Backward: Broadcast d_y to shape of x
Mean
Mean over dimension(s)
Forward: y = sum(x, dim) / N
Backward: Broadcast d_y / N to shape of x
Max
Max over dimension(s)
Forward: y = max(x, dim)
Backward: Gradient flows only to max element (argmax mask)
Testing Requirements
Every operation must have:
- Unit tests with known values
test "matmul: 2x2 known result" {
// [[1,2],[3,4]] @ [[5,6],[7,8]] = [[19,22],[43,50]]
}
- Gradient checks
test "matmul: gradient check" {
// Compare autograd gradient vs numerical gradient
}
- Backend parity tests
test "matmul: cpu vs cuda" {
// Verify GPU output matches CPU (within epsilon)
}
- Benchmarks
bench "matmul: 1024x1024" {
// Measure GFLOPS
}
Optimizers
SGD
Status: ✅ Reference CPU
Stochastic Gradient Descent with constant learning rate.
Update Rule:
param -= lr * grad
Backend Status:
- CPU Scalar: ✅ (
optim/sgd.zig) - CPU SIMD: ⏳
- CUDA: ⏳
Tests:
tests/integration/xor.zig
Implementation Checklist
For each operation:
- Mathematical definition documented
- Forward pass implemented (CPU scalar)
- Backward pass implemented (CPU scalar)
- Unit tests with known values
- Gradient check tests
- Benchmarked (baseline)
- CPU SIMD implementation
- CUDA implementation
- Backend parity tests
- Performance validated (vs napkin math)
Future Operations (v0.2+)
- Conv2D
- MaxPool2D
- Dropout
- Embedding
- Attention (fused)
- RMSNorm