SIMD Vectorization

This module demonstrates SIMD (Single Instruction, Multiple Data) programming techniques.

Contents

File Topic Key Concept
src/auto_vectorize.cpp Auto-Vectorization Compiler-friendly patterns
src/intrinsics_intro.cpp SIMD Intrinsics Manual SSE/AVX/AVX-512
include/simd_wrapper.hpp SIMD Wrapper Readable abstractions

Key Concepts

Auto-Vectorization

Write code that compilers can automatically vectorize:

// Good: simple loop, no dependencies
void add_arrays(float* a, const float* b, const float* c, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        a[i] = b[i] + c[i];
    }
}

// Bad: loop-carried dependency
void prefix_sum(float* a, size_t n) {
    for (size_t i = 1; i < n; ++i) {
        a[i] += a[i-1];  // Depends on previous iteration
    }
}

Check vectorization with compiler flags:

# GCC
g++ -O3 -march=native -fopt-info-vec-optimized file.cpp

# Clang
clang++ -O3 -march=native -Rpass=loop-vectorize file.cpp

SIMD Intrinsics

Manual vectorization for maximum control:

#include <immintrin.h>

// SSE: 4 floats at once
void add_sse(float* a, const float* b, const float* c, size_t n) {
    for (size_t i = 0; i < n; i += 4) {
        __m128 vb = _mm_load_ps(&b[i]);
        __m128 vc = _mm_load_ps(&c[i]);
        __m128 va = _mm_add_ps(vb, vc);
        _mm_store_ps(&a[i], va);
    }
}

// AVX2: 8 floats at once
void add_avx2(float* a, const float* b, const float* c, size_t n) {
    for (size_t i = 0; i < n; i += 8) {
        __m256 vb = _mm256_load_ps(&b[i]);
        __m256 vc = _mm256_load_ps(&c[i]);
        __m256 va = _mm256_add_ps(vb, vc);
        _mm256_store_ps(&a[i], va);
    }
}

SIMD Wrapper

Use the provided wrapper for readable SIMD code:

#include "simd_wrapper.hpp"

using Vec = simd::Vec256<float>;

void add_wrapped(float* a, const float* b, const float* c, size_t n) {
    for (size_t i = 0; i < n; i += Vec::size()) {
        Vec vb = Vec::load(&b[i]);
        Vec vc = Vec::load(&c[i]);
        Vec va = vb + vc;
        va.store(&a[i]);
    }
}

Instruction Sets

ISA Register Width Floats/Op Doubles/Op
SSE 128-bit 4 2
AVX2 256-bit 8 4
AVX-512 512-bit 16 8

Running Benchmarks

cmake --preset=release
cmake --build build/release

./build/release/examples/04-simd-vectorization/bench/simd_bench

Expected Results

Implementation Relative Speed
Scalar 1x
Auto-vectorized 2-4x
SSE 3-4x
AVX2 6-8x
AVX-512 10-16x

Actual speedup depends on:

  • CPU architecture
  • Memory bandwidth
  • Data alignment
  • Operation complexity

CPU Feature Detection

Check your CPU capabilities:

# Linux
cat /proc/cpuinfo | grep flags

# Look for: sse, sse2, sse4_1, avx, avx2, avx512f

Further Reading

results matching ""

    No results matching ""