hpc_ai_opt Python Bindings
==========================

`hpc_ai_opt` exposes a small set of CUDA kernels through nanobind.
The current Python API is intentionally thin:

- inputs and outputs are CUDA tensors supplied by the caller
- several kernels require explicit shape arguments
- the bindings mirror the C++ kernel layout through submodules
- wrappers perform basic argument validation before launching asynchronous CUDA work

Build and import
----------------

.. code-block:: bash

   cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON
   cmake --build build
   export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}"
   python -c "import hpc_ai_opt; print(hpc_ai_opt.__doc__)"

Current modules
---------------

- ``hpc_ai_opt.elementwise``
  - ``relu(input, output)``
  - ``sigmoid(input, output)``
  - ``transpose(input, output, rows, cols)``
- ``hpc_ai_opt.reduction``
  - ``softmax(input, output, batch, seq_len)``
  - ``layer_norm(input, gamma, beta, output, batch, hidden_size, eps)``
  - ``rms_norm(input, gamma, output, batch, hidden_size, eps)``
- ``hpc_ai_opt.gemm``
  - ``matmul(A, B, C, M, N, K, alpha, beta)``

Minimal example
---------------

.. code-block:: python

   import torch
   import hpc_ai_opt as opt

   x = torch.randn(1024, 1024, device="cuda", dtype=torch.float32)
   y = torch.empty_like(x)

   opt.elementwise.relu(x, y)
   torch.testing.assert_close(y, torch.relu(x))

Notes and current limitations
-----------------------------

- The bindings target CUDA tensors rather than NumPy CPU arrays.
- Output tensors are allocated by the caller.
- The public Python surface currently includes elementwise, reduction, and GEMM bindings only.
- Higher-level wrappers such as ``flash_attention`` are not exposed in the current extension module.
- Argument validation happens in the binding layer, but execution remains asynchronous with respect to the host.

Reference files
---------------

- ``examples/python/basic_usage.py``
- ``python/bindings/bindings.cpp``
- ``python/CMakeLists.txt``