🚀 Performance Optimization

Tuning strategies, benchmark testing, and best practices.

Performance Overview
1. Core Performance Metrics
Kernel Selection Strategy
Performance Tuning Guide
Benchmarking
Performance Optimization Best Practices
Benchmark Data
1. NVIDIA RTX 3090 (Ampere) Test Results
2. Different GPU Architecture Comparison
Troubleshooting
1. Common Reasons for Poor Performance
2. Performance Analysis Tools

Performance Overview

GPU SpMV library achieves extreme performance through multiple kernel intelligent scheduling.

Core Performance Metrics

Metric	Description	Target
Bandwidth Utilization	Actual memory bandwidth / theoretical peak	> 60%
Compute Density	FLOPS / byte access	Matrix-dependent
Scalability	Performance growth with matrix size	Linear scaling

Kernel Selection Strategy

1. Scalar CSR Kernel

Use Case: Very sparse matrices (avg_nnz < 4)

// Auto-selection
SpMVConfig config = spmv_auto_config(csr);
// config.kernel_type == KernelType::SCALAR_CSR

Performance Characteristics:

Each thread processes one row
Minimizes inter-thread coordination overhead
Suitable for cases with very few non-zero elements

Bandwidth Utilization: ~40-50%

2. Vector CSR Kernel

Use Case: Moderate sparsity matrices (skewness < 10)

Performance Characteristics:

Each warp collaboratively processes one row
Coalesced memory access pattern
Balanced load distribution

Bandwidth Utilization: ~65-75%

3. Merge Path Kernel

Use Case: Highly skewed matrices (skewness ≥ 10)

Performance Characteristics:

Perfect load balancing
Binary search partition points
Adaptive to matrix features

Bandwidth Utilization: ~70-80%

4. ELL Kernel

Use Case: ELL format matrices

Performance Characteristics:

Fully coalesced memory access
Column-major storage
Highest bandwidth utilization

Bandwidth Utilization: ~80-90%

Performance Tuning Guide

1. Auto Configuration (Recommended)

// Let the library automatically select optimal kernel
SpMVConfig config = spmv_auto_config(csr);
SpMVResult result = spmv_csr(csr, d_x, d_y, &config, n);

Advantages:

No manual tuning required
Intelligent selection based on matrix features
Suitable for most scenarios

2. Manual Kernel Selection

// Manually select for specific scenarios
SpMVConfig config;
config.kernel_type = KernelType::MERGE_PATH;
config.auto_select = false;

SpMVResult result = spmv_csr(csr, d_x, d_y, &config, n);

Use Cases:

Known stable matrix features
Need extreme performance
Auto-selection results not ideal

3. Format Conversion

// CSR -> ELL conversion
ELLMatrix* ell = ell_create(num_rows, num_cols, max_nnz_per_row);
ell_from_csr(ell, csr);
ell_to_gpu(ell);

// ELL format usually performs better
SpMVResult result = spmv_ell(ell, d_x, d_y, n);

When to Convert:

Matrix row lengths are uniform
Non-zero elements per row variation < 20%
Pursuing extreme performance

Benchmarking

Running Benchmarks

#include <spmv/benchmark.h>

BenchmarkConfig config;
config.iterations = 100;      // 100 iterations
config.warmup = true;         // Warmup
config.print_details = true;  // Detailed information

spmv_benchmark(csr, &config);

Example Output

=== GPU SpMV Benchmark ===
Matrix: 10000 x 10000, nnz = 500000
Kernel: Vector CSR
Iterations: 100 (10 warmup)

Results:
  Avg:  2.34 ms
  Min:  2.12 ms
  Max:  2.89 ms
  Std:  0.15 ms

Bandwidth: 68.5 GB/s (70.2% of peak)
GFLOPS: 42.8

Custom Benchmark

#include <spmv/spmv.h>
#include <chrono>

void custom_benchmark(const CSRMatrix* csr, 
                     const float* d_x, 
                     float* d_y, 
                     int n,
                     int iterations) {
    // Warmup
    SpMVConfig config = spmv_auto_config(csr);
    for (int i = 0; i < 5; i++) {
        spmv_csr(csr, d_x, d_y, &config, n);
    }
    
    // Official test
    cudaDeviceSynchronize();
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int i = 0; i < iterations; i++) {
        spmv_csr(csr, d_x, d_y, &config, n);
    }
    
    cudaDeviceSynchronize();
    auto end = std::chrono::high_resolution_clock::now();
    
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(
                        end - start).count();
    
    printf("Avg: %.3f ms\n", duration / 1000.0 / iterations);
}