SIMD Exercises

Practice auto-vectorization, intrinsics, and SIMD wrapper usage.

Exercise 1: Enable Auto-Vectorization

Objective

Write code that the compiler can automatically vectorize.

Setup

cpp

// File: auto_vectorize_exercise.cpp
#include <vector>
#include <cstddef>

// TODO: Make this vectorizable
void add_arrays(float* a, const float* b, const float* c, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        a[i] = b[i] + c[i];
    }
}

Tasks

Compile with vectorization report

bash

g++ -O3 -march=native -fopt-info-vec-optimized auto_vectorize_exercise.cpp

Check if vectorized
If not, modify the code to enable vectorization

Questions

Q: What prevents auto-vectorization?

Common blockers:

Pointer aliasing (compiler can't prove a, b, c don't overlap)
Loop-carried dependencies
Unknown trip count
Complex control flow
Function calls inside loop

Hint

Click for hint

Add __restrict to tell the compiler pointers don't overlap:

cpp

void add_arrays(float* __restrict a, const float* __restrict b, 
                const float* __restrict c, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        a[i] = b[i] + c[i];
    }
}

Exercise 2: Convert Scalar to SIMD

Objective

Convert a scalar function to use AVX2 intrinsics.

Setup

cpp

// Scalar version
void scale_array_scalar(float* arr, float scalar, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        arr[i] *= scalar;
    }
}

// TODO: Implement AVX2 version
void scale_array_avx2(float* arr, float scalar, size_t n) {
    // Your code here
}

Tasks

Implement the AVX2 version using _mm256_* intrinsics
Handle the remainder (elements not divisible by 8)

Benchmark both versions

bash

g++ -O3 -march=native -mavx2 scale_exercise.cpp -o scale_exercise
./scale_exercise

Reference

cpp

#include <immintrin.h>

// Load 8 floats (unaligned)
__m256 _mm256_loadu_ps(const float* addr);

// Store 8 floats (unaligned)
void _mm256_storeu_ps(float* addr, __m256 a);

// Broadcast scalar to all lanes
__m256 _mm256_set1_ps(float a);

// Multiply
__m256 _mm256_mul_ps(__m256 a, __m256 b);

Solution Skeleton

Click for solution skeleton

cpp

void scale_array_avx2(float* arr, float scalar, size_t n) {
    __m256 vscalar = _mm256_set1_ps(scalar);
    
    size_t i = 0;
    // Process 8 floats at a time
    for (; i + 8 <= n; i += 8) {
        __m256 v = _mm256_loadu_ps(&arr[i]);
        v = _mm256_mul_ps(v, vscalar);
        _mm256_storeu_ps(&arr[i], v);
    }
    
    // Handle remainder
    for (; i < n; ++i) {
        arr[i] *= scalar;
    }
}

Exercise 3: SIMD Wrapper Usage

Objective

Use the provided SIMD wrapper for cleaner code.

Tasks

Rewrite Exercise 2 using hpc::simd::FloatVec
Compare readability
Benchmark both implementations

Reference

cpp

#include "simd_wrapper.hpp"

using namespace hpc::simd;

FloatVec v(&arr[i]);           // Load
FloatVec vscalar(scalar);       // Broadcast
FloatVec result = v * vscalar;  // Multiply
result.store(&arr[i]);          // Store

Exercise 4: Horizontal Operations

Objective

Implement a dot product using SIMD.

Setup

cpp

// Scalar version
float dot_product_scalar(const float* a, const float* b, size_t n) {
    float sum = 0.0f;
    for (size_t i = 0; i < n; ++i) {
        sum += a[i] * b[i];
    }
    return sum;
}

// TODO: Implement SIMD version
float dot_product_simd(const float* a, const float* b, size_t n) {
    // Your code here
}

Hint

Click for hint

Use FMA (fused multiply-add) for efficiency:

Accumulate products in a SIMD vector
Use horizontal_sum() at the end
Handle remainder elements separately

Exercise 5: Comparison Operations

Objective

Implement clamping using SIMD min/max operations.

Tasks

Implement clamp_array that limits values to [min_val, max_val]
Use SIMD min/max intrinsics
Benchmark against scalar

Reference

cpp

__m256 _mm256_min_ps(__m256 a, __m256 b);
__m256 _mm256_max_ps(__m256 a, __m256 b);

Challenge: Optimize a Real Algorithm

Objective

Optimize a Mandelbrot set calculation using SIMD.

Setup

The Mandelbrot calculation is embarrassingly parallel - each pixel is independent.

cpp

int mandelbrot(float cx, float cy, int max_iter) {
    float x = 0, y = 0;
    int iter = 0;
    while (x*x + y*y <= 4.0f && iter < max_iter) {
        float temp = x*x - y*y + cx;
        y = 2*x*y + cy;
        x = temp;
        iter++;
    }
    return iter;
}

Tasks

Vectorize to process 8 pixels simultaneously
Handle the control flow (early exit on divergence)
Measure speedup

Hint

Click for hint

Since SIMD doesn't support early exit per lane, use a mask to track which pixels have diverged:

cpp

__m256 cx_vec, cy_vec;  // 8 c values
__m256 x_vec = _mm256_setzero_ps();
__m256 y_vec = _mm256_setzero_ps();
__m256 iter_vec = _mm256_setzero_ps();
__m256 active_mask = _mm256_castsi256_ps(_mm256_set1_epi32(-1));

for (int iter = 0; iter < max_iter; ++iter) {
    // Only process active pixels
    // ...
}

Solutions

See solutions.md for complete solutions.

SIMD Exercises ​

Exercise 1: Enable Auto-Vectorization ​

Objective ​

Setup ​

Tasks ​

Questions ​

Hint ​

Exercise 2: Convert Scalar to SIMD ​

Objective ​

Setup ​

Tasks ​

Reference ​

Solution Skeleton ​

Exercise 3: SIMD Wrapper Usage ​

Objective ​

Tasks ​

Reference ​

Exercise 4: Horizontal Operations ​

Objective ​

Setup ​

Hint ​

Exercise 5: Comparison Operations ​

Objective ​

Tasks ​

Reference ​

Challenge: Optimize a Real Algorithm ​

Objective ​

Setup ​

Tasks ​

Hint ​

Solutions ​

Next Steps ​

SIMD Exercises

Exercise 1: Enable Auto-Vectorization

Objective

Setup

Tasks

Questions

Hint

Exercise 2: Convert Scalar to SIMD

Objective

Setup

Tasks

Reference

Solution Skeleton

Exercise 3: SIMD Wrapper Usage

Objective

Tasks

Reference

Exercise 4: Horizontal Operations

Objective

Setup

Hint

Exercise 5: Comparison Operations

Objective

Tasks

Reference

Challenge: Optimize a Real Algorithm

Objective

Setup

Tasks

Hint

Solutions

Next Steps