Profiling Guide
This guide covers essential profiling tools and techniques for analyzing C++ performance.
Optimization Workflow
mermaid
flowchart TD
A[Identify Performance Issue] --> B[Profile with perf]
B --> C[Generate FlameGraph]
C --> D[Identify Hot Functions]
D --> E{CPU or Memory bound?}
E -->|CPU| F{Vectorizable?}
E -->|Memory| G{High cache misses?}
F -->|Yes| H[SIMD Optimization]
F -->|No| I[Algorithm Change]
G -->|Yes| J[Data Layout SOA]
G -->|No| K[Prefetching]
H --> L[Implement Fix]
I --> L
J --> L
K --> L
L --> M[Re-run Benchmark]
M --> N{Faster?}
N -->|Yes| O[Document & Commit]
N -->|No| P[Try Different Approach]
P --> B
style A fill:#ff6b6b
style O fill:#6bcb77
style N fill:#ffd93dOverview
Performance optimization follows a simple cycle:
- Measure - Profile to find bottlenecks
- Analyze - Understand the root cause
- Optimize - Apply targeted improvements
- Verify - Measure again to confirm improvement
Tools
perf (Linux)
perf is the standard Linux profiling tool.
Profiling Workflow with perf
mermaid
sequenceDiagram
participant Dev as Developer
participant Perf as perf
participant App as Application
participant FG as FlameGraph
Dev->>Perf: perf record -g ./benchmark
Perf->>App: Execute with sampling
App-->>Perf: Profile data (perf.data)
Dev->>Perf: perf script
Perf-->>Dev: Call stacks
Dev->>FG: stackcollapse + flamegraph.pl
FG-->>Dev: flamegraph.svg
Dev->>Dev: Identify hotspotsInstallation
bash
# Ubuntu/Debian
sudo apt-get install linux-tools-common linux-tools-generic
# Fedora
sudo dnf install perfBasic Usage
bash
# Record CPU samples
perf record -g ./your_benchmark
# View report
perf report
# Show annotated source
perf annotateUseful Commands
bash
# CPU cycles breakdown
perf stat ./your_benchmark
# Cache miss analysis
perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./your_benchmark
# Branch prediction
perf stat -e branches,branch-misses ./your_benchmark
# Record with call graph (dwarf for C++)
perf record -g --call-graph dwarf ./your_benchmarkFlameGraph
FlameGraphs provide intuitive visualization of where time is spent.
Using the Project Script
bash
# Generate FlameGraph for a benchmark
./tools/performance/generate_flamegraph.sh ./build/release/examples/02-memory-cache/bench/aos_soa_bench
# View the result
firefox flamegraph.svgManual Generation
bash
# Clone FlameGraph tools (if not already done)
git clone https://github.com/brendangregg/FlameGraph.git
# Record with perf
perf record -F 99 -g ./your_benchmark
# Generate FlameGraph
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flamegraph.svgReading FlameGraphs
- Width = Time spent (wider = more time)
- Height = Call stack depth
- Color = Random (no meaning)
- Top = Currently executing function
- Bottom = Entry point (main)
Look for:
- Wide plateaus (hot functions)
- Deep stacks (excessive call depth)
- Unexpected functions taking time
Valgrind
Valgrind provides detailed memory and cache analysis.
Cachegrind (Cache Simulation)
bash
# Run cache simulation
valgrind --tool=cachegrind ./your_benchmark
# View results
cg_annotate cachegrind.out.*Output shows:
- I1 cache misses (instruction cache)
- D1 cache misses (L1 data cache)
- LL cache misses (last-level cache)
Callgrind (Call Graph Profiling)
bash
# Run call graph profiling
valgrind --tool=callgrind ./your_benchmark
# View with KCachegrind (GUI)
kcachegrind callgrind.out.*Intel VTune (Advanced)
VTune provides the most detailed analysis on Intel CPUs.
Installation
Download from Intel oneAPI.
Basic Usage
bash
# Hotspots analysis
vtune -collect hotspots ./your_benchmark
# Memory access analysis
vtune -collect memory-access ./your_benchmark
# Microarchitecture analysis
vtune -collect uarch-exploration ./your_benchmark
# View results
vtune-guiProfiling Strategies
CPU-Bound Code
- Start with
perf statfor overview - Use
perf record+ FlameGraph to find hot functions - Use
perf annotateto see hot instructions - Check vectorization with compiler reports
bash
# Check if code is vectorized
g++ -O3 -march=native -fopt-info-vec-optimized your_code.cppMemory-Bound Code
- Check cache misses with
perf stat - Use Cachegrind for detailed cache analysis
- Look for:
- High L1 miss rate (> 5%)
- High LLC miss rate (> 1%)
- Poor spatial locality
bash
# Quick cache check
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./your_benchmarkMulti-threaded Code
- Check for false sharing
- Analyze lock contention
- Verify thread scaling
bash
# Check for cache line bouncing (false sharing indicator)
perf stat -e cache-misses ./your_benchmark
# Run with different thread counts
OMP_NUM_THREADS=1 ./your_benchmark
OMP_NUM_THREADS=2 ./your_benchmark
OMP_NUM_THREADS=4 ./your_benchmarkCommon Performance Issues
1. Cache Misses
Symptoms:
- High L1/L2/L3 miss rates
- Memory bandwidth saturation
Solutions:
- Improve data locality (SOA layout)
- Use prefetching
- Reduce working set size
2. Branch Mispredictions
Symptoms:
- High branch-misses count
- Unpredictable control flow
Solutions:
- Use branchless code
- Sort data to improve prediction
- Use CMOV instructions
3. False Sharing
Symptoms:
- Poor multi-threaded scaling
- High cache-to-cache transfers
Solutions:
- Pad data to cache line boundaries
- Use thread-local storage
- Reduce shared state
4. Vectorization Failures
Symptoms:
- Scalar code in hot loops
- No SIMD instructions in assembly
Solutions:
- Align data
- Use
restrictpointers - Simplify loop structure
- Use explicit SIMD intrinsics
Benchmark Best Practices
Avoid Measurement Errors
cpp
// Prevent dead code elimination
benchmark::DoNotOptimize(result);
// Force memory writes to be visible
benchmark::ClobberMemory();Warm Up Caches
cpp
// Run a few iterations before measuring
for (int i = 0; i < warmup_iterations; ++i) {
do_work();
}Control Environment
bash
# Disable CPU frequency scaling
sudo cpupower frequency-set --governor performance
# Pin to specific CPU
taskset -c 0 ./your_benchmark
# Disable ASLR for reproducibility
echo 0 | sudo tee /proc/sys/kernel/randomize_va_spaceStatistical Significance
- Run multiple iterations
- Report mean, median, and standard deviation
- Use Google Benchmark's built-in statistics
bash
# Run with statistics
./your_benchmark --benchmark_repetitions=10 --benchmark_report_aggregates_only=trueQuick Reference
| Task | Tool | Command |
|---|---|---|
| CPU hotspots | perf | perf record -g ./bench && perf report |
| Cache misses | perf | perf stat -e cache-misses ./bench |
| Visual profile | FlameGraph | ./tools/performance/generate_flamegraph.sh ./bench |
| Detailed cache | Valgrind | valgrind --tool=cachegrind ./bench |
| Call graph | Valgrind | valgrind --tool=callgrind ./bench |
| Vectorization | GCC | -fopt-info-vec-optimized |
| Vectorization | Clang | -Rpass=loop-vectorize |
See Also
- Learning Path - Follow the curriculum
- Best Practices - Optimization patterns
- Troubleshooting - Common issues