Performance Optimization Decision Tree
This guide helps you navigate the optimization process with a structured decision framework.
Overview
Performance optimization is not random—it follows a systematic approach. This decision tree helps you identify the right optimization technique for your specific situation.
The Golden Rule
Always profile before optimizing!
Premature optimization is the root of all evil (or at least most of it) in programming. — Donald Knuth
Decision Flowchart
Optimization Decision FlowMermaid
Quick Diagnosis Checklist
CPU-Bound Symptoms
| Indicator | Command | Threshold |
|---|---|---|
| High CPU utilization | perf stat | > 80% CPU busy |
| Low cache miss rate | perf stat -e cache-misses | < 1% LLC miss |
| Hot loops in profile | perf report | Single loop dominates |
Go to: SIMD Optimization
Memory-Bound Symptoms
| Indicator | Command | Threshold |
|---|---|---|
| High cache misses | perf stat -e cache-misses | > 5% LLC miss |
| Memory bandwidth saturated | perf stat -e bus-cycles | Near peak |
| Long memory latency | perf mem | High latency |
Go to: Memory Optimization
Concurrency Issues
| Indicator | Command | Detection |
|---|---|---|
| Poor thread scaling | OMP_NUM_THREADS varying | Speedup < 0.5 times threads |
| Lock contention | perf record -e lock:* | High lock wait time |
| Data races | cmake --preset=tsan | TSAN reports races |
Go to: Concurrency Optimization
Optimization Techniques by Category
SIMD Optimization
SIMD Optimization FlowMermaid
Key Techniques:
- Auto-vectorization - Let compiler do the work
- Alignment -
alignas(64)for SIMD-friendly data - Restrict pointers -
float* __restrict ptr - Manual intrinsics - SSE/AVX/AVX-512 when compiler fails
Reference: SIMD Module
Memory Optimization
Memory Optimization FlowMermaid
Key Techniques:
- AOS to SOA - Structure of Arrays for sequential access
- False sharing fix -
alignas(64)padding - Prefetching -
__builtin_prefetchfor predictable patterns - Cache blocking - Process data in cache-sized chunks
Reference: Memory Module
Concurrency Optimization
Concurrency Optimization FlowMermaid
Key Techniques:
- Atomic operations -
std::atomicwith correct memory ordering - Lock-free structures - SPSC queues, lock-free stacks
- False sharing prevention - Cache line padding
- OpenMP parallelization -
#pragma omp parallel for
Reference: Concurrency Module
Decision Quick Reference
| Your Situation | First Try | Alternative |
|---|---|---|
| Loop runs many iterations | Auto-vectorization | Manual SIMD |
| Sequential data access, high cache miss | AOS to SOA | Prefetching |
| Multi-threaded, poor scaling | Check false sharing | Lock-free |
| Random memory access | Blocking/tiling | B-tree structure |
| High branch misprediction | Branchless code | Sort data |
| Small critical section | Lock-free | Fine-grained locks |
Profiling Workflow
Profiling WorkflowMermaid
Common Pitfalls
Premature Optimization
cpp
// BAD: Optimizing without profiling
void process() {
// "This loop looks slow, let me SIMD it"
for (int i = 0; i < 10; ++i) { ... } // Only 10 iterations!
}Fix: Profile first. Small loops may not matter.
Wrong Optimization for the Problem
cpp
// BAD: Using SIMD for memory-bound code
void process(float* data, size_t n) {
// Memory bandwidth saturated, SIMD won't help!
for (size_t i = 0; i < n; ++i) {
data[i] = data[i] * 2.0f; // Memory-bound, not CPU-bound
}
}Fix: Check if CPU or memory bound first.
Ignoring Algorithmic Complexity
cpp
// BAD: Optimizing O(n^2) algorithm
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) { // O(n^2)
// SIMD won't save you here
}
}Fix: Use better algorithm first (O(n log n) or O(n)).
Next Steps
- Profile your code using the Profiling Guide
- Identify the bottleneck using the decision tree above
- Apply the appropriate technique from the relevant module
- Measure the improvement with benchmarks
- Document your findings for future reference
Related Resources
- Learning Path - Structured curriculum
- Profiling Guide - How to profile
- Best Practices - Industry patterns
- FAQ - Common questions