Memory Optimization Exercises
Practice cache optimization, data layout, and memory alignment techniques.
Exercise 1: AOS to SOA Conversion
Objective
Convert an Array of Structures layout to Structure of Arrays and measure the performance difference.
Setup
The particle system example in examples/02-memory-cache/src/aos_vs_soa.cpp demonstrates this. You will modify it.
Tasks
Study the existing code
bashcat examples/02-memory-cache/src/aos_vs_soa.cppAdd a new field to the particle system (e.g.,
mass)Update both AOS and SOA structures
Benchmark the change
bash./build/release/examples/02-memory-cache/bench/aos_soa_bench
Questions
Q1: How does adding a field affect AOS vs SOA performance?
Think about it before expanding...
Adding a field to AOS increases the stride between same-field accesses, making cache utilization worse for field-wise operations. SOA is unaffected because each field array is independent.
Q2: When would AOS be better than SOA?
AOS is better when you access all fields together (e.g., particle.x, particle.y, particle.z in sequence). This is common in object-oriented designs or when updating all particle properties at once.
Exercise 2: Fix False Sharing
Objective
Identify and fix false sharing in a multi-threaded counter.
Setup
// File: false_sharing_exercise.cpp
#include <vector>
#include <thread>
#include <atomic>
#include <iostream>
void increment_counters(std::atomic<int>* counters, int thread_id, int iterations) {
for (int i = 0; i < iterations; ++i) {
counters[thread_id].fetch_add(1, std::memory_order_relaxed);
}
}
int main() {
const int num_threads = 4;
const int iterations = 10000000;
// TODO: This has false sharing!
std::atomic<int> counters[num_threads];
std::vector<std::thread> threads;
for (int i = 0; i < num_threads; ++i) {
counters[i] = 0;
threads.emplace_back(increment_counters, counters, i, iterations);
}
for (auto& t : threads) {
t.join();
}
return 0;
}Tasks
Compile and run the code above
bashg++ -O3 -pthread false_sharing_exercise.cpp -o false_sharing_exercise time ./false_sharing_exerciseFix the false sharing using
alignas(64)or paddingCompare performance
Hint
Click for hint
The problem is that std::atomic<int> is typically 4 bytes, and 4 counters fit in a single cache line. Use alignment to ensure each counter is on its own cache line:
struct alignas(64) AlignedCounter {
std::atomic<int> value{0};
};
AlignedCounter counters[num_threads];Questions
Q: What speedup did you achieve?
Typical speedup is 2-10x depending on CPU and number of threads. More threads = more contention = bigger speedup from fixing false sharing.
Exercise 3: Prefetch Distance Tuning
Objective
Find the optimal prefetch distance for different access patterns.
Setup
Use the prefetch example:
cat examples/02-memory-cache/src/prefetch.cppTasks
Run the benchmark with default settings
bash./build/release/examples/02-memory-cache/bench/prefetch_benchModify the prefetch distance (try 8, 16, 32, 64, 128)
Plot the results
Questions
Q: What's the optimal prefetch distance?
The optimal distance depends on:
- Memory latency
- Loop iteration time
- Cache size
A good rule of thumb: distance = memory_latency / iteration_time. For typical systems, 32-64 elements ahead works well.
Exercise 4: Memory Alignment for SIMD
Objective
Measure the impact of memory alignment on SIMD performance.
Tasks
Create aligned and unaligned arrays
cpp// Unaligned float* unaligned = new float[1024]; // Aligned alignas(64) float aligned[1024]; // or float* aligned = static_cast<float*>( std::aligned_alloc(64, 1024 * sizeof(float)) );Run SIMD operations on both
Compare performance
Questions
Q: When does alignment matter most?
Alignment matters most for:
- SIMD load/store instructions that require alignment (e.g.,
_mm_load_psvs_mm_loadu_ps) - AVX-512 which has stricter alignment requirements
- Large data sets where cache efficiency is critical
Challenge: Profile a Mystery Program
Objective
Use profiling tools to identify and fix a performance bottleneck.
Setup
A "mystery" program with a hidden bottleneck is provided in the exercises directory.
Tasks
Profile the program
bashperf record -g ./mystery_program perf reportGenerate a FlameGraph
bashperf script | stackcollapse-perf.pl | flamegraph.pl > mystery_flame.svgIdentify the bottleneck
Fix it
Measure the improvement
Hint
Click for hint
Look for:
- Wide sections in the FlameGraph (hot functions)
- High cache miss rates
- Functions taking unexpectedly long
Solutions
See solutions.md for complete solutions to these exercises.
Next Steps
- Continue to SIMD Exercises
- Read the Memory Utilities API
- Explore Optimization Decision Tree