Ztorch Backends

Backend implementation details and architecture.

Overview

Ztorch supports multiple backends, each optimized for different hardware:

CPU Scalar - Reference implementation, obviously correct
CPU SIMD - Vectorized with AVX2/AVX512 (x86) or NEON (ARM)
CUDA - NVIDIA GPUs via PTX assembly
ROCm - AMD GPUs via LLVM IR
Vulkan - Cross-vendor via SPIR-V

Backend Interface

All backends implement the same interface:

pub const Backend = struct {
    vtable: *const VTable,
    context: *anyopaque,

    pub const VTable = struct {
        // Lifecycle
        init: *const fn (Allocator) anyerror!*anyopaque,
        deinit: *const fn (*anyopaque) void,

        // Memory
        alloc: *const fn (*anyopaque, usize) anyerror!DevicePtr,
        free: *const fn (*anyopaque, DevicePtr) void,
        copy_to_device: *const fn (*anyopaque, []const u8, DevicePtr) anyerror!void,
        copy_from_device: *const fn (*anyopaque, DevicePtr, []u8) anyerror!void,

        // Operations
        matmul: *const fn (*anyopaque, Tensor, Tensor) anyerror!Tensor,
        relu: *const fn (*anyopaque, Tensor) anyerror!Tensor,
        softmax: *const fn (*anyopaque, Tensor, usize) anyerror!Tensor,
        // ... all ops
    };
};

CPU Scalar Backend

Purpose: Reference implementation for correctness.

Characteristics:

Simple, readable code
No SIMD, no assembly
Serves as ground truth for testing

Example: MatMul

fn matmul_cpu_scalar(a: Tensor, b: Tensor, c: *Tensor) void {
    const M = a.shape.dims[0];
    const K = a.shape.dims[1];
    const N = b.shape.dims[1];

    for (0..M) |i| {
        for (0..N) |j| {
            var sum: f32 = 0;
            for (0..K) |k| {
                sum += a.data[i * K + k] * b.data[k * N + j];
            }
            c.data[i * N + j] = sum;
        }
    }
}

Expected Performance:

MatMul 1024x1024: ~5 GFLOPS (single core)

CPU SIMD Backend

Purpose: Optimized CPU implementation using vector instructions.

Implementation:

x86: AVX2 (256-bit) or AVX-512 (512-bit) intrinsics
ARM: NEON (128-bit) intrinsics
Runtime detection of CPU capabilities
Fallback to scalar if unsupported

Example: ReLU with AVX2

fn relu_cpu_avx2(input: []f32, output: []f32) void {
    const zero = @Vector(8, f32){0, 0, 0, 0, 0, 0, 0, 0};

    var i: usize = 0;
    while (i + 8 <= input.len) : (i += 8) {
        const vec: @Vector(8, f32) = input[i..][0..8].*;
        const result = @maximum(vec, zero);
        output[i..][0..8].* = result;
    }

    // Handle remaining elements
    while (i < input.len) : (i += 1) {
        output[i] = @maximum(input[i], 0);
    }
}

Expected Performance:

4-8x speedup over scalar (depending on operation)
MatMul 1024x1024: ~20-40 GFLOPS (single core)

CUDA Backend

Purpose: High-performance execution on NVIDIA GPUs.

Architecture:

Generate PTX assembly at comptime
Load via CUDA Driver API
Use tensor cores when available

Code Generation:

fn generateMatMulPTX(
    comptime M: usize,
    comptime N: usize,
    comptime K: usize,
) [:0]const u8 {
    comptime {
        const tile_size = 32;

        var ptx: []const u8 = ".version 8.5\n";
        ptx = ptx ++ ".target sm_80\n"; // Ampere
        ptx = ptx ++ ".address_size 64\n\n";

        // Generate tiled matmul kernel
        // Use wmma.mma for tensor cores
        // ...

        return ptx;
    }
}

Optimization Techniques:

Tiling for shared memory
Tensor core usage (wmma instructions)
Memory coalescing
Bank conflict avoidance

Expected Performance:

RTX 4090: 4000+ GFLOPS for MatMul
Limited by memory bandwidth for smaller ops

Tensor Cores:

// Load into matrix fragments
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_a, [%ptr_a];
wmma.load.sync.aligned.m16n16k16.global.f32 %frag_b, [%ptr_b];

// Multiply-accumulate
wmma.mma.sync.aligned.m16n16k16.f32.f32 %frag_c, %frag_a, %frag_b, %frag_c;

// Store result
wmma.store.sync.aligned.m16n16k16.global.f32 [%ptr_c], %frag_c;

ROCm Backend

Purpose: Support for AMD GPUs.

Architecture:

Generate LLVM IR at comptime
Compile via ROCm HIP toolchain
Similar optimization strategies to CUDA

Code Generation:

fn generateMatMulLLVM(
    comptime M: usize,
    comptime N: usize,
    comptime K: usize,
) [:0]const u8 {
    comptime {
        var llvm: []const u8 = "";

        // LLVM IR for tiled matmul
        llvm = llvm ++ "define void @matmul(...) {\n";
        // ...
        llvm = llvm ++ "}\n";

        return llvm;
    }
}

Expected Performance:

Similar to CUDA (depends on GPU model)
MI300X: 5000+ GFLOPS for MatMul

Vulkan Backend

Purpose: Cross-vendor GPU support (NVIDIA, AMD, Intel).

Architecture:

Generate SPIR-V at comptime
Use Vulkan compute shaders
Portable across all GPUs

Trade-offs:

More portable but less optimized than native backends
No tensor core access (uses FP32 MAD operations)
Good for inference, adequate for training

Code Generation:

fn generateMatMulSPIRV(
    comptime M: usize,
    comptime N: usize,
    comptime K: usize,
) []const u8 {
    comptime {
        // SPIR-V assembly
        var spirv: []const u8 = "";

        // OpTypeFloat, OpTypeVector, etc.
        // OpMatrixTimesVector or manual loop
        // ...

        return spirv;
    }
}

Expected Performance:

50-80% of native backend performance
Good enough for most inference workloads

Backend Selection

Compile-time

const Model = ztorch.Sequential(.{ /* ... */ });
var model = try Model.compile(.cuda, allocator); // Fixed at compile time

Runtime

const backend: Device = if (cuda_available) .cuda else .cpu;
var model = try Model.compile(backend, allocator);

Auto-detection

const backend = try ztorch.selectBestBackend(); // Detects available hardware
var model = try Model.compile(backend, allocator);

Testing Backend Parity

All backends must produce identical results (within floating-point precision).

test "matmul: cpu vs cuda parity" {
    const input_a = try Tensor.randn(.{32, 64}, .f32, .cpu);
    const input_b = try Tensor.randn(.{64, 128}, .f32, .cpu);

    // CPU result
    const output_cpu = try ops.matmul_cpu(input_a, input_b);

    // CUDA result
    const input_a_gpu = try input_a.to(.cuda);
    const input_b_gpu = try input_b.to(.cuda);
    const output_cuda = try ops.matmul_cuda(input_a_gpu, input_b_gpu);
    const output_cuda_cpu = try output_cuda.to(.cpu);

    // Compare
    try testing.expectApproxEqSlice(
        f32,
        output_cpu.data,
        output_cuda_cpu.data,
        1e-4, // epsilon
    );
}

Performance Validation

Every backend implementation must meet minimum performance requirements.

Napkin Math Target:

CPU Scalar: Baseline
CPU SIMD: >2x scalar
CUDA: >10x CPU for N>1024
ROCm: Similar to CUDA
Vulkan: >5x CPU

Example Validation:

bench "matmul: 1024x1024 performance check" {
    const result = try benchMatMul(.cuda, 1024, 1024, 1024);

    // RTX 4090 theoretical: 82 TFLOPS
    // 2*1024^3 FLOPs = 2.1B FLOPs
    // Minimum 50% efficiency: 41 TFLOPS = 51 µs
    try testing.expect(result.elapsed_ns < 51_000);
}

Future Backends

Metal (Apple Silicon) - v0.3
WebGPU (browser) - v0.4
CPU Multi-threaded - v0.2 (OpenMP-style)

Keyboard shortcuts

Ztorch Documentation