Installation
Use this page to prepare a working environment and run the first validation steps.
Requirements
| Area | Baseline | Notes |
|---|---|---|
| Python | >=3.9 | The package metadata targets Python 3.9+ |
| PyTorch | >=2.0.0 | CUDA build required for actual Triton kernel execution |
| Triton | >=2.1.0 | OpenAI Triton |
| GPU | CUDA-capable NVIDIA GPU | Needed for kernels and GPU benchmarks |
Install from source
git clone https://github.com/LessUp/triton-fused-ops.git
cd triton-fused-ops
pip install -e ".[dev]"
If you only need the package itself:
pip install -e .
If you use uv:
uv pip install -e ".[dev]"
CPU-safe baseline checks
These checks do not require running Triton kernels and are suitable for CI or CPU-only validation paths:
python -c "import triton_ops; print(triton_ops.__version__)"
ruff format --check .
ruff check .
mypy triton_ops/
pytest tests/ -v -k "not cuda and not gpu" --ignore=tests/benchmarks/
python3 -m build
GPU smoke test
import torch
from triton_ops import fused_rmsnorm_rope
assert torch.cuda.is_available()
batch, seq_len, hidden_dim, head_dim = 2, 128, 4096, 64
x = torch.randn(batch, seq_len, hidden_dim, device="cuda", dtype=torch.float16)
weight = torch.ones(hidden_dim, device="cuda", dtype=torch.float16)
cos = torch.randn(seq_len, head_dim, device="cuda", dtype=torch.float16)
sin = torch.randn(seq_len, head_dim, device="cuda", dtype=torch.float16)
y = fused_rmsnorm_rope(x, weight, cos, sin)
print(y.shape, y.dtype)
Notes:
- The current implementation accepts
cosandsinin shape[seq_len, head_dim]. - Validation also accepts cached 4D RoPE tensors in shape
[1, seq_len, 1, head_dim]. - Runtime validation expects CUDA tensors, supported floating dtypes, and contiguous inputs.
Environment sanity check
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("CUDA version:", torch.version.cuda)
print("GPU:", torch.cuda.get_device_name())
Common problems
CUDA is not available
Typical causes:
- PyTorch was installed without CUDA support.
- The active Python environment cannot see the expected NVIDIA driver/runtime.
Typical recovery:
pip install torch --index-url https://download.pytorch.org/whl/cu121
DeviceError on kernel calls
The exported kernels check that tensors are on CUDA before launching. Move all inputs to the same CUDA device and keep them contiguous.
UnsupportedDtypeError or shape validation failures
Use the API pages for the exact constraints:
fused_rmsnorm_rope: 3Dx, 1Dweight, 2D or 4D RoPE cache.fused_gated_mlp: 3Dx, 2D weights, activation in{"silu", "gelu"}.fp8_gemm: 2D matrices; pre-quantizeduint8inputs require matching scale tensors.