GPU Benchmark Results: Orbit Regime Comparison
Executive Summary
Benchmark results reveal dramatic performance differences between orbit regimes:
| Regime | Kernel Speedup (1000 sats) | GPU Throughput | Key Finding |
|---|---|---|---|
| LEO Only | 83.2x | 439M props/sec | GPU excels at near-earth propagation |
| GEO Only | 1.6x | 5.8M props/sec | Deep-space propagation barely benefits |
| Mixed (60/40) | 3.5x | 14.7M props/sec | Two-kernel optimization critical |
Full Benchmark Results
Test Configuration
- Time points: 10,080 (7 days at 1-minute intervals)
- Satellite counts: 10, 50, 100, 1000
- Orbit regimes: LEO only, GEO only, Mixed (60% LEO / 40% GEO)
LEO Only (SGP4 Near-Earth Propagation)
| Satellites | CPU Time | GPU Kernel | GPU + Transfer | Kernel Speedup | GPU Throughput |
|---|---|---|---|---|---|
| 10 | 19ms | 0.4ms | 1.1ms | 46.9x | 248M/sec |
| 50 | 96ms | 1.6ms | 12ms | 61.6x | 325M/sec |
| 100 | 191ms | 2.7ms | 23ms | 71.4x | 377M/sec |
| 1000 | 1,911ms | 23ms | 247ms | 83.2x | 439M/sec |
Key Observations - LEO
- Massive GPU advantage: 83x speedup for kernel-only at 1000 satellites
- Transfer bottleneck: 90%+ overhead for large batches
- For 1000 sats: 224ms transfer vs 23ms computation
- Scaling: Speedup increases with satellite count (46x → 83x)
- Throughput: GPU achieves 439 million propagations/sec
- CPU performance: Consistent ~5.3M props/sec regardless of batch size
Recommendation: LEO-only workloads are ideal for GPU acceleration. Use GPU-resident mode to eliminate transfer overhead.
GEO Only (SDP4 Deep-Space Propagation)
| Satellites | CPU Time | GPU Kernel | GPU + Transfer | Kernel Speedup | GPU Throughput |
|---|---|---|---|---|---|
| 10 | 28ms | 29ms | 29ms | 0.99x | 3.5M/sec |
| 50 | 143ms | 112ms | 122ms | 1.28x | 4.5M/sec |
| 100 | 286ms | 194ms | 215ms | 1.48x | 5.2M/sec |
| 1000 | 2,860ms | 1,739ms | 1,958ms | 1.64x | 5.8M/sec |
Key Observations - GEO
- Minimal GPU advantage: Only 1.64x speedup even at 1000 satellites
- Low transfer overhead: Only 11% (SDP4 is compute-heavy, transfer is relatively small)
- Limited scaling: Speedup barely improves with satellite count (0.99x → 1.64x)
- Throughput: GPU achieves only 5.8M props/sec (75x slower than LEO!)
- CPU performance: ~3.5M props/sec (33% slower than LEO on CPU too)
Why GEO is slow on GPU: - Deep-space propagation (SDP4) has complex perturbation calculations - More branching and conditional logic → warp divergence - Lunar/solar perturbations require iterative solvers - Higher computational intensity per satellite
Recommendation: For GEO-only workloads, consider CPU parallelism (rayon) instead of GPU unless batch size is very large (1000+).
Mixed (60% LEO, 40% GEO)
| Satellites | CPU Time | GPU Kernel | GPU + Transfer | Kernel Speedup | GPU Throughput |
|---|---|---|---|---|---|
| 10 | 24ms | 29ms | 29ms | 0.83x | 3.5M/sec |
| 50 | 119ms | 57ms | 67ms | 2.11x | 8.9M/sec |
| 100 | 238ms | 86ms | 107ms | 2.78x | 11.8M/sec |
| 1000 | 2,382ms | 685ms | 893ms | 3.48x | 14.7M/sec |
Key Observations - Mixed
- Moderate speedup: 3.48x at 1000 satellites
- Transfer overhead: 23% (moderate, 209ms transfer vs 685ms computation)
- Two-kernel benefit: Without partition, would suffer warp divergence
- Throughput: 14.7M props/sec (between LEO and GEO as expected)
- Crossover point: GPU becomes faster than CPU at ~30-40 satellites
Performance composition: - 60% LEO contribution: ~439M/sec × 0.6 × (600 sats / 1000) = ~158M props/sec on LEO portion - 40% GEO contribution: ~5.8M/sec × 0.4 × (400 sats / 1000) = ~0.9M props/sec on GEO portion - GEO satellites dominate execution time despite being only 40% of the constellation
Recommendation: Mixed workloads benefit significantly from two-kernel optimization. Without partitioning, LEO threads would be blocked waiting for GEO threads in the same warp.
Analysis: Why LEO is 75x Faster than GEO on GPU
SGP4 (Near-Earth / LEO)
- Simple perturbations (J2, J3, J4 gravitational harmonics)
- Minimal branching
- No iterative solvers
- ~50 double-precision operations per propagation
- Excellent GPU parallelism
SDP4 (Deep-Space / GEO)
- Complex perturbations (lunar, solar, resonance effects)
- Deep conditional branches for:
- Synchronous vs non-synchronous orbits
- Resonance detection and handling
- Lyddane coordinate conversion
- Iterative Newton-Raphson solvers
- ~200+ double-precision operations per propagation
- Poor GPU parallelism due to divergence
Transfer Overhead Analysis
| Regime | 1000 sats | Transfer Time | % Overhead | Data Size |
|---|---|---|---|---|
| LEO | 10.08M props | 224ms | 90.7% | 560MB |
| GEO | 10.08M props | 219ms | 11.2% | 560MB |
| Mixed | 10.08M props | 209ms | 23.3% | 560MB |
Key insight: Transfer time is constant (~220ms for 560MB), but appears as higher overhead when kernel time is low (LEO).
PCIe bandwidth: 560MB / 220ms = 2.5 GB/sec - This is below PCIe 3.0 theoretical bandwidth (~12 GB/sec) - Likely due to non-contiguous memory access patterns and cudarc overhead
Recommendations by Use Case
Use GPU When:
- LEO-heavy workloads (>50% LEO satellites)
- Large batches (100+ satellites)
- GPU-resident pipelines (collision detection, visualization, etc.)
- Need to free CPU for other tasks
Use CPU When:
- GEO-heavy workloads (>70% GEO satellites)
- Small batches (<30 satellites)
- Single propagations (one satellite, one time)
- No GPU available (graceful fallback)
Optimization Opportunities
| Optimization | Target Regime | Potential Gain |
|---|---|---|
| GPU-resident mode | LEO | Eliminate 90% transfer overhead |
| Optimize SDP4 branch reduction | GEO | 1.5-2x improvement possible |
| Half-precision for LEO | LEO | 2x throughput (if accuracy acceptable) |
| CPU parallelism (rayon) | GEO | 4-8x with multi-core CPU |
| Hybrid CPU+GPU | Mixed | Process LEO on GPU, GEO on CPU |
Conclusion
The benchmark reveals that:
- LEO propagation is GPU's sweet spot with 83x speedup
- GEO propagation barely benefits from GPU parallelism (1.6x)
- Two-kernel optimization is critical for mixed workloads
- Transfer overhead dominates for fast operations (LEO)
The implementation successfully addresses mixed-constellation performance through intelligent partitioning, achieving 3.5x speedup despite the GEO bottleneck. For LEO-only workloads, the GPU provides transformative performance (83x), while GEO-only workloads may be better served by CPU parallelism.