Installation

Use this page to prepare a working environment and run the first validation steps.

Requirements

Area Baseline Notes
Python >=3.9 The package metadata targets Python 3.9+
PyTorch >=2.0.0 CUDA build required for actual Triton kernel execution
Triton >=2.1.0 OpenAI Triton
GPU CUDA-capable NVIDIA GPU Needed for kernels and GPU benchmarks

Install from source

git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"

If you only need the package itself:

pip install -e .

If you use uv:

uv pip install -e ".[dev]"

CPU-safe baseline checks

These checks do not require running Triton kernels and are suitable for CI or CPU-only validation paths:

python -c "import triton_ops; print(triton_ops.__version__)"
ruff format --check .
ruff check .
mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m build

GPU smoke test

import torch
from triton_ops import fused_rmsnorm_rope

assert torch.cuda.is_available()

batch, seq_len, hidden_dim, head_dim = 2, 128, 4096, 64
x = torch.randn(batch, seq_len, hidden_dim, device="cuda", dtype=torch.float16)
weight = torch.ones(hidden_dim, device="cuda", dtype=torch.float16)
cos = torch.randn(seq_len, head_dim, device="cuda", dtype=torch.float16)
sin = torch.randn(seq_len, head_dim, device="cuda", dtype=torch.float16)

y = fused_rmsnorm_rope(x, weight, cos, sin)
print(y.shape, y.dtype)

Notes:

  • The current implementation accepts cos and sin in shape [seq_len, head_dim].
  • Validation also accepts cached 4D RoPE tensors in shape [1, seq_len, 1, head_dim].
  • Runtime validation expects CUDA tensors, supported floating dtypes, and contiguous inputs.

Environment sanity check

import torch

print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA version:", torch.version.cuda)
    print("GPU:", torch.cuda.get_device_name())

Common problems

CUDA is not available

Typical causes:

  • PyTorch was installed without CUDA support.
  • The active Python environment cannot see the expected NVIDIA driver/runtime.

Typical recovery:

pip install torch --index-url https://download.pytorch.org/whl/cu121

DeviceError on kernel calls

The exported kernels check that tensors are on CUDA before launching. Move all inputs to the same CUDA device and keep them contiguous.

UnsupportedDtypeError or shape validation failures

Use the API pages for the exact constraints:

  • fused_rmsnorm_rope: 3D x, 1D weight, 2D or 4D RoPE cache.
  • fused_gated_mlp: 3D x, 2D weights, activation in {"silu", "gelu"}.
  • fp8_gemm: 2D matrices; pre-quantized uint8 inputs require matching scale tensors.

Next