CUDA Memory Cheat Sheet
This quick sheet is now part of the Resources Hub. Use it when you need to re-orient yourself before reopening kernel code, profiler output, or WMMA constraints.
When this page is most useful
- Right before reading Memory Flow or Tiled Kernel again.
- When a benchmark changed and you need a fast checklist before blaming occupancy or Tensor Cores.
- When you want to explain memory behavior in an interview without reopening the full CUDA manuals.
Coalescing quick rules
- Consecutive threads in a warp should touch consecutive addresses whenever possible.
- Accessing
B[k * N + col]with largeNcan become stride-heavy for neighboring threads. - Tiling is not only about reuse; it also reshapes access into more coalesced loads.
Shared-memory watchpoints
| Question | Why it matters |
|---|---|
| Are threads writing a tile layout that later reads back contiguously? | Shared memory only helps when it fixes a global-memory access problem instead of creating a local one. |
| Did padding or index remapping remove bank conflicts on the hot path? | Bank conflicts can erase the benefit of otherwise-good tiling choices. |
| Did the extra shared-memory footprint change occupancy enough to matter? | Some tiling wins disappear if the launch geometry becomes too constrained. |
Tensor Core memory notes
| Topic | What to remember |
|---|---|
| Alignment constraints | WMMA paths expect dimensions aligned to fragment-friendly sizes, typically 16, for efficient fragment handling. |
| Data conversion | End-to-end timing includes conversion and wrapper logic, not just the fused matrix multiply. |
| Safe behavior | Non-friendly shapes should fall back to the FP32 path instead of forcing a misleading WMMA result. |
| Reporting | Distinguish end-to-end numbers from compute-only numbers before comparing implementations. |
Fast checklist when reading a kernel
- Can I explain the global-memory access order for one warp?
- Does shared-memory layout reduce conflicts rather than merely move data around?
- Are register accumulators bounded and intentional?
- Is Tensor Core fallback behavior explicit?
- Do benchmark labels match what the kernel path really measures?