I have a ray-tracing CUDA kernel and I’m trying to understand its performance through NSight Visual Studio profiling.
My goal is to understand why I’m only reaching about 320GFLOPs whereas my GPU can potentially reach 10.1TFLOPs (GTX 1080 at 1987Mhz).
(1) Based on what I understand from NVidia documentation:
a) My issue efficiency is low (40%), which means warp schedulers have no instruction to execute for 60% of the cycles, because all warps are stalled. Stalling is mainly because of memory (my kernel seems to be latency bound, which is expectable in ray-tracing).
b) It seems to me that warp schedulers have to schedule one FMA per cycle (counted as 2 FLOPs) to reach theoretical FLOPs, since cores have a throughput of 1 FMA/cycle. So, I can already see how a low issue efficiency is a problem to reach theoretical FLOPs.
a) Pipe utilization shows 43-47% arithmetic utilization. Shouldn’t this be lower than issue efficiency? Or does it have to be multiplied by issue efficiency to actually get an idea of throughput?
b) Arithmetic workload shows only 38% FP32 operations (the rest is CMP/min/max mainly and a couple others). So obviously there is a 62% performance reduction here.
(3) And last:
a) Branch statistics show 90% “branch efficiency” but only 50% “control flow efficiency”. So I see how this could lead to bad performance when combined with Issue Efficiency (1).
These experiments seem independent of each other and only partially explain the low FLOP performance.
So, my questions:
- Is this assumption correct: `warp schedulers have to schedule one FMA per cycle to achieve theoretical FLOPs` ?
- Is this assumption correct: `These experiments seem independent of each other [(1) and (2), (2) and (3)]` ?
- Are my calculations correct? If yes then why do they differ so much from achieved FLOPs?
- What other experiments can I do to understand where the other bottlenecks are?
- (BONUS) How can control flow be 50% when source-level 'Divergent branch' experiments shows all branches at 100% efficiency except from traverse code at 90% (traverse node loop), 87% (intersect triangle loop), and 2 traverse termination branches at 97% and 94% efficiency?
- (BONUS) How do I know where in the code are located so many Cmp/min/max instructions? I'm using video instructions (based on a paper that I can't remember), like `vmin.s32.s32.s32.min`, are they part of Cmp/min/max?
Bonus candy: if somehow I hit theoretical FLOPs, I would hit about 200fps on the Berkeley conference scene at 720p (using Instant Radiosity with 80 samples, my goal is to implement LightCuts with 64 samples which should be comparable performance) !