hpc_ai_opt Python Bindings
hpc_ai_opt exposes a small set of CUDA kernels through nanobind. The current Python API is intentionally thin:
inputs and outputs are CUDA tensors supplied by the caller
several kernels require explicit shape arguments
the bindings mirror the C++ kernel layout through submodules
wrappers perform basic argument validation before launching asynchronous CUDA work
Build and import
cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
cmake --build build
export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"
python -c "import hpc_ai_opt; print(hpc_ai_opt.__doc__)"
Current modules
hpc_ai_opt.elementwise-relu(input, output)-sigmoid(input, output)-transpose(input, output, rows, cols)hpc_ai_opt.reduction-softmax(input, output, batch, seq_len)-layer_norm(input, gamma, beta, output, batch, hidden_size, eps)-rms_norm(input, gamma, output, batch, hidden_size, eps)hpc_ai_opt.gemm-matmul(A, B, C, M, N, K, alpha, beta)
Minimal example
import torch
import hpc_ai_opt as opt
x = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
y = torch.empty_like(x)
opt.elementwise.relu(x, y)
torch.testing.assert_close(y, torch.relu(x))
Notes and current limitations
The bindings target CUDA tensors rather than NumPy CPU arrays.
Output tensors are allocated by the caller.
The public Python surface currently includes elementwise, reduction, and GEMM bindings only.
Higher-level wrappers such as
flash_attentionare not exposed in the current extension module.Argument validation happens in the binding layer, but execution remains asynchronous with respect to the host.
Reference files
examples/python/basic_usage.pypython/bindings/bindings.cpppython/CMakeLists.txt