Sorting Algorithms
Sorting is the canonical algorithmic problem. In high-performance C++, the interesting question is rarely "which sort is theoretically optimal" and usually "which sort is optimal for this data shape on this microarchitecture."
Overview
| Algorithm | Best Time | Average Time | Worst Time | Space | Stable | Notes |
|---|---|---|---|---|---|---|
| Quicksort (Hoare) | O(n log n) | O(n log n) | O(n²) | O(log n) | No | Cache-friendly in practice; pivot choice matters |
| Introsort | O(n log n) | O(n log n) | O(n log n) | O(log n) | No | Quicksort + heapsort fallback; std::sort uses this |
| Merge sort | O(n log n) | O(n log n) | O(n log n) | O(n) | Yes | Predictable, cache-friendly; good for linked data |
| Timsort | O(n) | O(n log n) | O(n log n) | O(n) | Yes | Exploits runs in real-world data |
| Radix sort (LSD) | O(nk) | O(nk) | O(nk) | O(n + r) | Yes | k = key width, r = radix; dominates for fixed-width keys |
| Bitonic sort | O(log² n) | O(log² n) | O(log² n) | O(1) | No | Data-parallel; excellent on SIMD/GPU |
Quicksort and Its Engineering
Why quicksort dominates in practice
Despite the O(n²) worst case, quicksort's average-case cache behavior and low instruction count make it the default choice for general-purpose sorting in C++ (std::sort is typically an introsort variant).
Key engineering decisions:
- Pivot selection. Median-of-three or median-of-medians reduces pathological inputs. Some implementations sample more pivots to avoid adversarial cases.
- Partition scheme. Hoare's partition visits each element once; Lomuto's is simpler but performs more swaps. Modern implementations favor Hoare or a hybrid.
- Small-array fallback. When subarrays drop below a threshold (typically 16–32 elements), insertion sort is faster due to lower overhead.
- Tail-call optimization. Recursing on the smaller partition first and looping on the larger keeps stack depth at O(log n).
Cache behavior
Quicksort is not inherently cache-optimal, but its divide-and-conquer structure tends to keep working sets within cache for moderate input sizes. The critical metric is the number of cache misses per element:
// A cache-conscious quicksort sketch
void quicksort(int* first, int* last) {
while (last - first > 32) {
auto* pivot = hoare_partition(first, last);
if (pivot - first < last - pivot) {
quicksort(first, pivot);
first = pivot + 1;
} else {
quicksort(pivot + 1, last);
last = pivot;
}
}
insertion_sort(first, last);
}Merge Sort and Stability
Merge sort guarantees O(n log n) and stability, at the cost of O(n) auxiliary memory. In C++, std::stable_sort uses a merge-sort variant.
Memory layout optimization
The naive merge sort allocates a temporary buffer for every merge. A better approach:
- Allocate one temporary array of size
nat the top level. - Alternate merging direction (A→B, then B→A) to avoid copies.
- This reduces allocation overhead from O(n log n) to O(n).
Radix Sort: When Big-O Loses
For fixed-width integer keys, radix sort can achieve linear time and outperform comparison sorts by 3–10x. The catch: it is not comparison-based and requires extra memory.
LSD (Least Significant Digit)
// Simplified LSD radix sort for 32-bit unsigned integers
void lsd_radix_sort(uint32_t* data, size_t n) {
constexpr int bits_per_pass = 8;
constexpr int buckets = 1 << bits_per_pass;
constexpr int mask = buckets - 1;
std::vector<uint32_t> temp(n);
for (int shift = 0; shift < 32; shift += bits_per_pass) {
size_t count[buckets] = {};
for (size_t i = 0; i < n; ++i) {
++count[(data[i] >> shift) & mask];
}
size_t prefix = 0;
for (int b = 0; b < buckets; ++b) {
size_t c = count[b];
count[b] = prefix;
prefix += c;
}
for (size_t i = 0; i < n; ++i) {
uint32_t key = data[i];
size_t bucket = (key >> shift) & mask;
temp[count[bucket]++] = key;
}
std::swap(data, temp);
}
}Performance characteristics:
- 4 passes for 32-bit keys (8 bits per pass).
- Each pass is a linear scan: cache-friendly if
countfits in L1. - Not stable unless implemented with careful offset tracking (the version above is stable).
SIMD opportunities
Some radix sort variants use SIMD to accelerate the histogram (counting) phase. AVX2 can process 8 integers at once, reducing the serial dependency of the counting loop. See the repository's examples/04-simd-vectorization/ for vectorization patterns applicable to histogram computation.
Bitonic Sort for Parallel Hardware
Bitonic sort is a data-parallel algorithm with O(log² n) depth, making it ideal for SIMD and GPU execution. It is not comparison-efficient for sequential CPUs but shines when wide vector units are available.
The key operation is the bitonic merge: compare-and-swap elements across a stride, then halve the stride. On AVX-512, a bitonic network of 16 elements fits in one register and can be sorted with a small sequence of permute-and-compare instructions.
Practical Selection Guide
Benchmark Notes
The repository's benchmark workflow follows this protocol:
- Generate input distributions: random, sorted, reverse-sorted, few-unique, adversarial (median-of-medians killer).
- Warm up the CPU and pin threads.
- Report cycles/element and LLC misses/element from
perf stat -e cycles,cache-misses. - Compare against
std::sort,std::stable_sort, andabsl::c_sortwhere available.
References
References
- (2011). Algorithms, 4th Edition. Addison-WesleyLink
- (1997). Introspective Sorting and Selection Algorithms. Software: Practice and Experience, 27(8)Link
- (2019). Decoding billions of integers per second through vectorization. Software: Practice and ExperienceLink
- (1999). A Killer Adversary for Quicksort. Software: Practice and Experience, 29(4)Link