Ztorch Backends
Backend implementation details and architecture.
Overview
Ztorch supports multiple backends, each optimized for different hardware:
- CPU Scalar - Reference implementation, obviously correct
- CPU SIMD - Vectorized with AVX2/AVX512 (x86) or NEON (ARM)
- CUDA - NVIDIA GPUs via PTX assembly
- ROCm - AMD GPUs via LLVM IR
- Vulkan - Cross-vendor via SPIR-V
Backend Interface
All backends implement the same interface:
pub const Backend = struct {
vtable: *const VTable,
context: *anyopaque,
pub const VTable = struct {
// Lifecycle
init: *const fn (Allocator) anyerror!*anyopaque,
deinit: *const fn (*anyopaque) void,
// Memory
alloc: *const fn (*anyopaque, usize) anyerror!DevicePtr,
free: *const fn (*anyopaque, DevicePtr) void,
copy_to_device: *const fn (*anyopaque, []const u8, DevicePtr) anyerror!void,
copy_from_device: *const fn (*anyopaque, DevicePtr, []u8) anyerror!void,
// Operations
matmul: *const fn (*anyopaque, Tensor, Tensor) anyerror!Tensor,
relu: *const fn (*anyopaque, Tensor) anyerror!Tensor,
softmax: *const fn (*anyopaque, Tensor, usize) anyerror!Tensor,
// ... all ops
};
};
CPU Scalar Backend
Purpose: Reference implementation for correctness.
Characteristics:
- Simple, readable code
- No SIMD, no assembly
- Serves as ground truth for testing
Example: MatMul
fn matmul_cpu_scalar(a: Tensor, b: Tensor, c: *Tensor) void {
const M = a.shape.dims[0];
const K = a.shape.dims[1];
const N = b.shape.dims[1];
for (0..M) |i| {
for (0..N) |j| {
var sum: f32 = 0;
for (0..K) |k| {
sum += a.data[i * K + k] * b.data[k * N + j];
}
c.data[i * N + j] = sum;
}
}
}
Expected Performance:
- MatMul 1024x1024: ~5 GFLOPS (single core)
CPU SIMD Backend
Purpose: Optimized CPU implementation using vector instructions.
Implementation:
- x86: AVX2 (256-bit) or AVX-512 (512-bit) intrinsics
- ARM: NEON (128-bit) intrinsics
- Runtime detection of CPU capabilities
- Fallback to scalar if unsupported
Example: ReLU with AVX2
fn relu_cpu_avx2(input: []f32, output: []f32) void {
const zero = @Vector(8, f32){0, 0, 0, 0, 0, 0, 0, 0};
var i: usize = 0;
while (i + 8 <= input.len) : (i += 8) {
const vec: @Vector(8, f32) = input[i..][0..8].*;
const result = @maximum(vec, zero);
output[i..][0..8].* = result;
}
// Handle remaining elements
while (i < input.len) : (i += 1) {
output[i] = @maximum(input[i], 0);
}
}
Expected Performance:
- 4-8x speedup over scalar (depending on operation)
- MatMul 1024x1024: ~20-40 GFLOPS (single core)
CUDA Backend
Purpose: High-performance execution on NVIDIA GPUs.
Architecture:
- Generate PTX assembly at comptime
- Load via CUDA Driver API
- Use tensor cores when available
Code Generation:
fn generateMatMulPTX(
comptime M: usize,
comptime N: usize,
comptime K: usize,
) [:0]const u8 {
comptime {
const tile_size = 32;
var ptx: []const u8 = ".version 8.5\n";
ptx = ptx ++ ".target sm_80\n"; // Ampere
ptx = ptx ++ ".address_size 64\n\n";
// Generate tiled matmul kernel
// Use wmma.mma for tensor cores
// ...
return ptx;
}
}
Optimization Techniques:
- Tiling for shared memory
- Tensor core usage (wmma instructions)
- Memory coalescing
- Bank conflict avoidance
Expected Performance:
- RTX 4090: 4000+ GFLOPS for MatMul
- Limited by memory bandwidth for smaller ops
Tensor Cores:
// Load into matrix fragments
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_a, [%ptr_a];
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_b, [%ptr_b];
// Multiply-accumulate
wmma.mma.sync.aligned.m16n16k16.f32.f32 %frag_c, %frag_a, %frag_b, %frag_c;
// Store result
wmma.store.sync.aligned.m16n16k16.global.f32 [%ptr_c], %frag_c;
ROCm Backend
Purpose: Support for AMD GPUs.
Architecture:
- Generate LLVM IR at comptime
- Compile via ROCm HIP toolchain
- Similar optimization strategies to CUDA
Code Generation:
fn generateMatMulLLVM(
comptime M: usize,
comptime N: usize,
comptime K: usize,
) [:0]const u8 {
comptime {
var llvm: []const u8 = "";
// LLVM IR for tiled matmul
llvm = llvm ++ "define void @matmul(...) {\n";
// ...
llvm = llvm ++ "}\n";
return llvm;
}
}
Expected Performance:
- Similar to CUDA (depends on GPU model)
- MI300X: 5000+ GFLOPS for MatMul
Vulkan Backend
Purpose: Cross-vendor GPU support (NVIDIA, AMD, Intel).
Architecture:
- Generate SPIR-V at comptime
- Use Vulkan compute shaders
- Portable across all GPUs
Trade-offs:
- More portable but less optimized than native backends
- No tensor core access (uses FP32 MAD operations)
- Good for inference, adequate for training
Code Generation:
fn generateMatMulSPIRV(
comptime M: usize,
comptime N: usize,
comptime K: usize,
) []const u8 {
comptime {
// SPIR-V assembly
var spirv: []const u8 = "";
// OpTypeFloat, OpTypeVector, etc.
// OpMatrixTimesVector or manual loop
// ...
return spirv;
}
}
Expected Performance:
- 50-80% of native backend performance
- Good enough for most inference workloads
Backend Selection
Compile-time
const Model = ztorch.Sequential(.{ /* ... */ });
var model = try Model.compile(.cuda, allocator); // Fixed at compile time
Runtime
const backend: Device = if (cuda_available) .cuda else .cpu;
var model = try Model.compile(backend, allocator);
Auto-detection
const backend = try ztorch.selectBestBackend(); // Detects available hardware
var model = try Model.compile(backend, allocator);
Testing Backend Parity
All backends must produce identical results (within floating-point precision).
test "matmul: cpu vs cuda parity" {
const input_a = try Tensor.randn(.{32, 64}, .f32, .cpu);
const input_b = try Tensor.randn(.{64, 128}, .f32, .cpu);
// CPU result
const output_cpu = try ops.matmul_cpu(input_a, input_b);
// CUDA result
const input_a_gpu = try input_a.to(.cuda);
const input_b_gpu = try input_b.to(.cuda);
const output_cuda = try ops.matmul_cuda(input_a_gpu, input_b_gpu);
const output_cuda_cpu = try output_cuda.to(.cpu);
// Compare
try testing.expectApproxEqSlice(
f32,
output_cpu.data,
output_cuda_cpu.data,
1e-4, // epsilon
);
}
Performance Validation
Every backend implementation must meet minimum performance requirements.
Napkin Math Target:
- CPU Scalar: Baseline
- CPU SIMD: >2x scalar
- CUDA: >10x CPU for N>1024
- ROCm: Similar to CUDA
- Vulkan: >5x CPU
Example Validation:
bench "matmul: 1024x1024 performance check" {
const result = try benchMatMul(.cuda, 1024, 1024, 1024);
// RTX 4090 theoretical: 82 TFLOPS
// 2*1024^3 FLOPs = 2.1B FLOPs
// Minimum 50% efficiency: 41 TFLOPS = 51 µs
try testing.expect(result.elapsed_ns < 51_000);
}
Future Backends
- Metal (Apple Silicon) - v0.3
- WebGPU (browser) - v0.4
- CPU Multi-threaded - v0.2 (OpenMP-style)