hpc_ai_opt Python Bindings

hpc_ai_opt exposes a small set of CUDA kernels through nanobind. The current Python API is intentionally thin:

  • inputs and outputs are CUDA tensors supplied by the caller

  • several kernels require explicit shape arguments

  • the bindings mirror the C++ kernel layout through submodules

  • wrappers perform basic argument validation before launching asynchronous CUDA work

Build and import

cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
cmake --build build
export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"
python -c "import hpc_ai_opt; print(hpc_ai_opt.__doc__)"

Current modules

  • hpc_ai_opt.elementwise - relu(input, output) - sigmoid(input, output) - transpose(input, output, rows, cols)

  • hpc_ai_opt.reduction - softmax(input, output, batch, seq_len) - layer_norm(input, gamma, beta, output, batch, hidden_size, eps) - rms_norm(input, gamma, output, batch, hidden_size, eps)

  • hpc_ai_opt.gemm - matmul(A, B, C, M, N, K, alpha, beta)

Minimal example

import torch
import hpc_ai_opt as opt

x = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
y = torch.empty_like(x)

opt.elementwise.relu(x, y)
torch.testing.assert_close(y, torch.relu(x))

Notes and current limitations

  • The bindings target CUDA tensors rather than NumPy CPU arrays.

  • Output tensors are allocated by the caller.

  • The public Python surface currently includes elementwise, reduction, and GEMM bindings only.

  • Higher-level wrappers such as flash_attention are not exposed in the current extension module.

  • Argument validation happens in the binding layer, but execution remains asynchronous with respect to the host.

Reference files

  • examples/python/basic_usage.py

  • python/bindings/bindings.cpp

  • python/CMakeLists.txt