GPU Performance Analysis and Optimization
This document details the CUDA GPU implementation for SGP4/SDP4 satellite propagation, including benchmark results, optimization techniques applied, and analysis of performance characteristics.
Executive Summary
| Metric | Value |
|---|---|
| Peak GPU Throughput | 14.6M propagations/sec |
| Kernel-only Speedup vs Sequential CPU | 3.4x |
| Speedup with GPU-to-CPU Transfer | 2.6-2.8x |
| Transfer Overhead | ~23% of total time |
The GPU implementation provides meaningful acceleration for batch propagation workloads, with the primary value being: - Freeing CPU for other tasks during propagation - GPU-resident mode for chained GPU operations (collision detection, etc.) - Consistent throughput that doesn't degrade with satellite count
Benchmark Results
Test Configuration
- Satellite mix: 60% LEO (SGP4), 40% GEO (SDP4)
- Time points: 10,080 (7 days at 1-minute intervals)
- Test satellites: Mix of ISS, Starlink, GPS, TDRS, and other operational satellites
Performance by Satellite Count
| Satellites | Total Props | CPU Time | GPU Kernel | GPU + Transfer | Kernel Speedup | Transfer Overhead |
|---|---|---|---|---|---|---|
| 10 | 100,800 | 24ms | 28ms | 29ms | 0.86x | 2.5% |
| 20 | 201,600 | 48ms | 28ms | 30ms | 1.69x | 4.5% |
| 40 | 403,200 | 96ms | 29ms | 37ms | 3.33x | 21.8% |
| 80 | 806,400 | 191ms | 57ms | 74ms | 3.36x | 23.0% |
| 160 | 1,612,800 | 378ms | 110ms | 136ms | 3.43x | 18.8% |
| 1000 | 10,080,000 | 2,362ms | 687ms | 894ms | 3.44x | 23.1% |
Throughput Comparison
| Implementation | Throughput (props/sec) |
|---|---|
| CPU Sequential | ~4.2M |
| GPU Kernel Only | ~14.6M |
| GPU + Transfer | ~11.3M |
Key Observations
- Crossover point: GPU becomes faster than CPU at ~30-40 satellites
- Throughput plateau: GPU throughput stabilizes at ~14.6M props/sec regardless of satellite count
- Transfer overhead: Consistently ~23% for large batches (560MB for 1000 sats × 10k times)
Optimizations Implemented
1. Two-Kernel Launch (Warp Divergence Elimination)
Problem: Mixed LEO/GEO constellations caused severe warp divergence. Threads in the same warp would execute different code paths (SGP4 for near-earth vs SDP4 for deep-space), serializing execution.
Solution: Partition satellites by orbital period at initialization: - SGP4 partition: Mean motion > 6.4 rev/day (period < 225 min) - SDP4 partition: Mean motion ≤ 6.4 rev/day (period ≥ 225 min)
Launch separate kernels for each partition, eliminating cross-partition warp divergence.
Impact: ~1.9x improvement (from 1.8x to 3.4x speedup)
2. GPU-Side Scatter (Indexed Kernel)
Problem: After two-kernel partitioning, results needed to be scattered back to original satellite order. CPU-side scatter was inefficient.
Solution: New sgp4_propagate_soa_indexed_kernel that accepts an index mapping array. Each partition kernel writes directly to the correct position in the shared output buffer.
// Write to indexed position in shared output buffer
int original_sat_idx = original_indices[sat_idx];
int out_idx = time_idx * n_total_sats + original_sat_idx;
state_x[out_idx] = state.x;
Impact: ~20% improvement in total throughput
3. GPU-Resident API
Problem: For GPU pipeline use cases (e.g., propagate then run collision detection), downloading results to CPU and re-uploading is wasteful.
Solution: propagate_soa_gpu_resident() returns CudaSlice<f64> buffers that remain on GPU memory.
// GPU-resident mode - no transfer
let gpu_results = propagator.propagate_soa_gpu_resident(×)?;
// Use directly in next GPU operation
collision_kernel.launch(cfg, (&gpu_results.x, &gpu_results.y, ...))?;
Impact: Eliminates 23% transfer overhead for chained GPU operations
4. SoA Memory Layout
Problem: Array-of-Structures (AoS) layout causes non-coalesced memory access on GPU.
Solution: Structure-of-Arrays (SoA) layout with time-major ordering:
buffer[time_idx * n_sats + sat_idx]
Adjacent threads access adjacent memory addresses, enabling coalesced memory transactions.
5. Shared Memory Time Caching
Problem: Each thread loading its own time value from global memory is inefficient.
Solution: Cooperative loading of time values into shared memory:
__shared__ double shared_times[256];
// Threads cooperatively load times
for (int i = thread_id; i < times_to_load; i += block_size) {
shared_times[i] = jd_times[time_block_start + i];
}
__syncthreads();
// All threads read from fast shared memory
double jd = shared_times[local_time_idx];
Performance Limiters
1. Register Pressure (Primary Bottleneck)
The SGP4/SDP4 algorithm requires many intermediate variables:
Orbital elements: inclination, raan, eccentricity, arg_perigee, mean_anomaly, mean_motion
Perturbation terms: xmdf, argpdf, nodedf, argpm, nodem, mm, nm, em, inclm
Position/velocity: x, y, z, vx, vy, vz, r, rdot, rvdot
Trigonometric: sin/cos of various angles (10+ pairs)
Intermediate: t2, t3, t4, tempa, tempe, templ, am, xlm, etc.
Estimated register usage: 100-150 registers per thread (800-1200 bytes)
Impact on occupancy: With 64KB registers per SM, only 2-3 warps can execute concurrently instead of the optimal 8-12 warps. This limits occupancy to ~25-33%.
2. Computational Intensity
SGP4 is compute-heavy with:
- 10+ SINCOS operations (~20 cycles each)
- 4+ fmod operations (~20-30 cycles each)
- Newton-Raphson iteration (up to 10 iterations with 2 SINCOS each)
- Multiple square roots and divisions
The kernel is compute-bound, not memory-bound.
3. Algorithm Structure
Even with two-kernel partitioning, internal branches remain: - Lyddane vs standard coordinate conversion - Eccentricity-dependent calculations - Error condition checks
Potential Future Optimizations
| Optimization | Potential Gain | Implementation Difficulty | Notes |
|---|---|---|---|
| Register spilling to local memory | 1.3-1.5x | High | Trade register pressure for memory latency |
| Precomputed trig lookup tables | 1.1-1.2x | Medium | Accuracy tradeoff |
| Half-precision intermediates | 1.5-2x | Medium | Significant accuracy impact |
| Fused multiply-add (FMA) | 1.05-1.1x | Low | Minor gains |
| Batch time upload caching | 1.05x | Low | Already partially implemented |
Realistic ceiling with all optimizations: ~5-6x speedup over sequential CPU
API Usage
Standard API (with GPU-to-CPU transfer)
use keplemon::gpu::{CudaSgp4Propagator, TleDataGpu};
let mut propagator = CudaSgp4Propagator::new()?;
propagator.init_satellites(&tle_data)?;
// Returns Vec<f64> arrays on CPU
let results = propagator.propagate_soa_arrays(&julian_dates)?;
println!("Position: ({}, {}, {})", results.x[0], results.y[0], results.z[0]);
GPU-Resident API (for GPU pipelines)
// Returns CudaSlice<f64> buffers on GPU
let gpu_results = propagator.propagate_soa_gpu_resident(&julian_dates)?;
// Use directly with other GPU operations
let device = propagator.cuda_device();
// custom_kernel.launch(cfg, (&gpu_results.x, &gpu_results.y))?;
// Convert to CPU only when needed
let cpu_results = gpu_results.to_soa_arrays(device)?;
Comparison with CPU Parallelism
For comparison, enabling Rayon parallel iteration on CPU typically achieves: - 4-core CPU: ~3-4x speedup over sequential - 8-core CPU: ~6-8x speedup over sequential
The GPU's 3.4x kernel speedup is comparable to a 4-core CPU, but: 1. GPU frees CPU cores for other work 2. GPU throughput is consistent regardless of satellite count 3. GPU-resident mode enables zero-copy pipelines
Recommendations
When to Use GPU
- Large batches: 100+ satellites with many time points
- GPU pipelines: When results feed into other GPU computations
- CPU offloading: When CPU is needed for other tasks
- Consistent latency: When predictable performance matters
When to Use CPU
- Small batches: <30 satellites
- Single propagations: One satellite, one time
- No GPU available: Graceful fallback
- Memory constrained: GPU memory is limited
Benchmark Reproduction
Run the benchmark with:
cargo test --features cuda test_benchmark_cpu_vs_gpu --release -- --nocapture
For quick validation:
cargo test --features cuda test_quick_benchmark --release -- --nocapture
Hardware Tested
Results in this document were obtained on consumer NVIDIA GPUs. Performance may vary on: - Data center GPUs (A100, H100): Expect 1.5-2x better due to more SMs - Older GPUs (Pascal, Volta): Expect 0.7-0.9x due to fewer SMs - Integrated GPUs: Not recommended for this workload