Memory Layout Deep Dive
This deep dive explores the critical relationship between data structure design and memory performance in modern C++ applications.
The Memory Hierarchy Reality
Modern CPUs have a complex memory hierarchy with dramatically different access latencies:
| Level | Size | Latency | Cycles |
|---|---|---|---|
| L1 Cache | 32-64 KB | ~1 ns | 3-4 |
| L2 Cache | 256-512 KB | ~4 ns | 10-12 |
| L3 Cache | 2-32 MB | ~12 ns | 40-50 |
| Main Memory | GBs | ~100 ns | 200-300 |
Key insight: A cache miss can cost 100x more than a cache hit. Memory layout optimization is about maximizing cache utilization.
AOS vs SOA: The Layout Battle
Array of Structures (AOS)
// Traditional approach - intuitive but often suboptimal
struct Particle {
float x, y, z; // Position (12 bytes)
float vx, vy, vz; // Velocity (12 bytes)
float mass; // Property (4 bytes)
}; // Total: 28 bytes per particle
std::vector<Particle> particles(10000);Problems with AOS:
- Poor cache utilization: When updating positions, we load velocity and mass into cache unnecessarily
- SIMD unfriendly: Data is interleaved, making vectorization difficult
- False sharing risk: Adjacent particles may share cache lines in multi-threaded scenarios
Structure of Arrays (SOA)
// Performance-oriented approach
struct ParticleSystem {
std::vector<float> x, y, z; // All x positions contiguous
std::vector<float> vx, vy, vz; // All velocities contiguous
std::vector<float> mass; // All masses contiguous
};
ParticleSystem particles;
particles.x.resize(10000);
particles.y.resize(10000);
// ... etc.Benefits of SOA:
- Perfect cache utilization: When processing positions, only position data enters cache
- SIMD friendly: Contiguous arrays enable natural vectorization
- Better prefetching: Hardware prefetcher can predict access patterns
Performance Comparison
// AOS update - scattered memory access
void updateAOS(std::vector<Particle>& p, float dt) {
for (auto& particle : p) {
particle.x += particle.vx * dt;
particle.y += particle.vy * dt;
particle.z += particle.vz * dt;
}
}
// SOA update - contiguous memory access, SIMD-friendly
void updateSOA(ParticleSystem& p, float dt) {
for (size_t i = 0; i < p.x.size(); ++i) {
p.x[i] += p.vx[i] * dt;
p.y[i] += p.vy[i] * dt;
p.z[i] += p.vz[i] * dt;
}
}Benchmark results (1M particles, AVX2 enabled):
| Layout | Time (ms) | Speedup | SIMD Utilization |
|---|---|---|---|
| AOS | 8.2 | 1.0x | ~15% |
| SOA | 2.1 | 3.9x | ~85% |
Cache Line Alignment
The 64-Byte Boundary
Modern x86 CPUs use 64-byte cache lines. Data crossing cache line boundaries requires two memory accesses.
#include <hpc/core.hpp>
// BAD: Unaligned structure
struct UnalignedCounter {
std::atomic<int64_t> counter1; // 8 bytes
std::atomic<int64_t> counter2; // 8 bytes
std::atomic<int64_t> counter3; // 8 bytes
std::atomic<int64_t> counter4; // 8 bytes
}; // Total: 32 bytes - all in one cache line!Problem: When thread A updates counter1 and thread B updates counter2, they cause false sharing - both threads fight for exclusive ownership of the same cache line.
Proper Alignment
// GOOD: Each counter on its own cache line
struct alignas(64) PaddedCounter {
std::atomic<int64_t> value;
// Implicit padding to 64 bytes
};
struct AlignedCounters {
PaddedCounter counter1; // 64 bytes (cache line 0)
PaddedCounter counter2; // 64 bytes (cache line 1)
PaddedCounter counter3; // 64 bytes (cache line 2)
PaddedCounter counter4; // 64 bytes (cache line 3)
};Memory Alignment API
The project provides utilities for alignment:
#include <hpc/core.hpp>
// Compile-time cache line size for alignas
struct alignas(hpc::CACHE_LINE_SIZE) MyData {
// ... fields ...
};
// Runtime cache line size detection
size_t line_size = hpc::cache_line_size();
// Aligned allocator for STL containers
std::vector<float, hpc::AlignedAllocator<float, 64>> aligned_data;False Sharing: The Silent Killer
Detection Pattern
False sharing symptoms:
- Multi-threaded scaling worse than expected
- High
perfmetrics for cache coherency traffic - Non-linear speedup degradation as thread count increases
Example: Parallel Counter
// BAD: False sharing
std::vector<int> counters(num_threads); // Contiguous memory!
#pragma omp parallel for
for (int i = 0; i < num_threads; ++i) {
for (int j = 0; j < 1000000; ++j) {
counters[i]++; // False sharing!
}
}
// GOOD: Cache-line padded
struct alignas(64) PaddedInt { int value; };
std::vector<PaddedInt> counters(num_threads);Performance Impact
| Threads | Bad (ms) | Good (ms) | Speedup Lost |
|---|---|---|---|
| 1 | 100 | 100 | 0% |
| 2 | 95 | 50 | 47% |
| 4 | 90 | 25 | 72% |
| 8 | 85 | 12 | 86% |
Memory Prefetching
Software Prefetch
#include <xmmintrin.h> // SSE intrinsics
void processWithPrefetch(const std::vector<float>& data) {
const size_t prefetch_distance = 8; // Prefetch 8 elements ahead
for (size_t i = 0; i < data.size(); ++i) {
// Prefetch future data
if (i + prefetch_distance < data.size()) {
_mm_prefetch(reinterpret_cast<const char*>(&data[i + prefetch_distance]),
_MM_HINT_T0); // Prefetch into L1
}
// Process current element
process(data[i]);
}
}Prefetch Guidelines
- Distance: Typically 8-32 elements ahead
- Timing: Too early = cache eviction, too late = no benefit
- Pattern: Works best for predictable access patterns
Practical Guidelines
When to Use SOA
✅ Use SOA when:
- Processing subsets of fields (e.g., only positions)
- SIMD vectorization is important
- Memory bandwidth is the bottleneck
- Data is accessed in predictable patterns
❌ Keep AOS when:
- All fields are always accessed together
- Object-oriented design is cleaner
- Random access patterns dominate
- Memory overhead of SOA is prohibitive
Alignment Rules
- Hot data: Align frequently accessed data to cache lines
- Thread-local data: Pad per-thread data to avoid false sharing
- Atomic variables: Each atomic on its own cache line if contended
- SIMD data: Align to 16 (SSE), 32 (AVX), or 64 (AVX-512) bytes