# NSight : How to calculate FLOP/s that's close to achieved FLOP/s

I have a ray-tracing CUDA kernel and I’m trying to understand its performance through NSight Visual Studio profiling.

My goal is to understand why I’m only reaching about 320GFLOPs whereas my GPU can potentially reach 10.1TFLOPs (GTX 1080 at 1987Mhz).

(1) Based on what I understand from NVidia documentation:
a) My issue efficiency is low (40%), which means warp schedulers have no instruction to execute for 60% of the cycles, because all warps are stalled. Stalling is mainly because of memory (my kernel seems to be latency bound, which is expectable in ray-tracing).
b) It seems to me that warp schedulers have to schedule one FMA per cycle (counted as 2 FLOPs) to reach theoretical FLOPs, since cores have a throughput of 1 FMA/cycle. So, I can already see how a low issue efficiency is a problem to reach theoretical FLOPs.

(2) However:
a) Pipe utilization shows 43-47% arithmetic utilization. Shouldn’t this be lower than issue efficiency? Or does it have to be multiplied by issue efficiency to actually get an idea of throughput?
b) Arithmetic workload shows only 38% FP32 operations (the rest is CMP/min/max mainly and a couple others). So obviously there is a 62% performance reduction here.

(3) And last:
a) Branch statistics show 90% “branch efficiency” but only 50% “control flow efficiency”. So I see how this could lead to bad performance when combined with Issue Efficiency (1).

These experiments seem independent of each other and only partially explain the low FLOP performance.

• Arithmetic pipeline is used only 43% of the time, and is doing only 38% FP32. So, theoretically, FLOPs should be ``` 10100GFLOPs * .43 * .38 = 1650GFLOPs ``` About 5x current performance.
• Instructions are issued only 40% of the time, to 50% of the the cores (see (3)). I'm assuming most instructions are arithmetic (of which about 38% are FP32 FMA), so FLOPs should be ``` 10100GFLOPs * .40 * .50 * .38 = 767GFLOPs ``` (still 2.5x current performance)
• So, my questions:

1. Is this assumption correct: `warp schedulers have to schedule one FMA per cycle to achieve theoretical FLOPs` ?
2. Is this assumption correct: `These experiments seem independent of each other [(1) and (2), (2) and (3)]` ?
3. Are my calculations correct? If yes then why do they differ so much from achieved FLOPs?
4. What other experiments can I do to understand where the other bottlenecks are?
5. (BONUS) How can control flow be 50% when source-level 'Divergent branch' experiments shows all branches at 100% efficiency except from traverse code at 90% (traverse node loop), 87% (intersect triangle loop), and 2 traverse termination branches at 97% and 94% efficiency?
6. (BONUS) How do I know where in the code are located so many Cmp/min/max instructions? I'm using video instructions (based on a paper that I can't remember), like `vmin.s32.s32.s32.min`, are they part of Cmp/min/max?

Bonus candy: if somehow I hit theoretical FLOPs, I would hit about 200fps on the Berkeley conference scene at 720p (using Instant Radiosity with 80 samples, my goal is to implement LightCuts with 64 samples which should be comparable performance) !

Nsight calculates FLOPS in the Achieved FLOPS experiment. In the Activity Editor if you set Experiment to Run to Custom you can add Achieved FLOPS experiment. If you click on the (?) icon next to the experiment the Activity Editor will display the weighting applied per instruction. For FP32 FMA and RSQ are 2 operations; all others counts as 1.

Achieved FLOPS is calculated by collecting “Thread Instructions Executed Not Predicated Off” counter. This can be seen in the CUDA Source View per SASS instruction. For all single precision floating point instructions this value is multiplied by the weight of the instructions (1 or 2).

The sum of the single precision operations is divided by the time duration of the kernel.

1. Is this assumption correct: `warp schedulers have to schedule one FMA per cycle to achieve theoretical FLOPs` ?
Yes and No. In order to hit the marketing definition of maximum FLOPS every warp scheduler (Maxwell - Pascal) would have to issue 1 FMA (all 32 threads active) per cycle. Nsight has separate values for ADD, MUL, FMA, and Special. The marketing definition would not include special. In order to hit the maximum including special the warp scheduler will have to dual issue at the repeat rate of the special function unit.

2. Is this assumption correct: `These experiments seem independent of each other [(1) and (2), (2) and (3)]` ?
If you are asking are issue efficiency, pipe utilization, and branch statistics independent I would say mostly. Issue efficiency impacts pipe utilization.

3. Are my calculations correct? If yes then why do they differ so much from achieved FLOPs?
Your assumption in Q1 is correct (FMA per scheduler per cycle).

Issue Efficiency is a measure of whether each warp scheduler issued an instruction every cycle. If the warp scheduler issued an FMA every cycle you could get
Issue Efficiency = 100%
FLOPS = 1/32 theoretical

Issue Efficiency measures the rate at which warp instructions are issued but does not consider active not predicated off threads. If only 1 threads is predicated true per FMA the kernel can only achieve 1/32 theoretical FLOPS.

4. What other experiments can I do to understand where the other bottlenecks are?

In understanding FLOPS specifically you have to achieve the following:
a. every cycle every warp scheduler issues an FMA
b. all 32 threads are active and predicated true
c. all warp scheduler execute for the same number of cycles

Achieved FLOPS and Instruction Count experiment help to understand the number of operations. In Instruction Count you will have to go into the Source View and loop for FMA instructions in the SASS view.

Pipe Utilization can help to see if the math pipes are being issued every cycle. If not you can look at Issue Efficiency/Issue Stall Reasons to see what is stalling the warps. You can also inspect the SASS. If you are issuing FADD/FMUL (and not FMA) the kernel will be limited to 50% theoretical FLOPS.

In order to determine if all threads are active you can look at the Source View. Add the counter “Thread Instructions Executed Not Predicated Off” by right clicking on the column header and executing Column Chooser. Compare this to Thread Instructions Executed.

In order to determine if there is a tail effect resulting in not all SMs (warp schedulers) being active for the duration of the kernel you can look at the Instruction Statistics Experiment. Instructions Per Clock needs to be 4 (1 per scheduler) and TPC Activity should be close to 100% for all TPCs (==SMs for Maxwell and Pascal).

5. (BONUS) How can control flow be 50% when source-level ‘Divergent branch’ experiments shows all branches at 100% efficiency except from traverse code at 90% (traverse node loop), 87% (intersect triangle loop), and 2 traverse termination branches at 97% and 94% efficiency?

If you launch a 1 threaded kernel the Control Flow Efficiency would be 1/32. It only takes 1 branch to get to a very low efficiency. The source view can show this information if you compare the Thread Instructions Executed the Thread Instructions Executed Not Predicated Off columns.

6. (BONUS) How do I know where in the code are located so many Cmp/min/max instructions? I’m using video instructions (based on a paper that I can’t remember), like `vmin.s32.s32.s32.min`, are they part of Cmp/min/max?

Collect the Instruction Count experiment and open the Source View. If you build with -lineinfo you can correlate the SASS to high level and find any instruction you like. Once you understand the SASS instructions that are generated you can use the “Disassembly Regex Match” experiment to count the number of any type of instruction you want. Achieved FLOPS actually uses the Disassembly Regex Match experiment. At the bottom of the description of Achieved FLOPS there is an edit box under Experiment Definition which is the Disassembly Regex Match for Achieved FLOPS (updated based upon the weights for operations).

Thank you for this very detailed and thorough answer. It seems that I was missing an experiment (Instruction count - not predicated off), otherwise it answers most of my technical interrogations.

The conclusion for me is the following:

• register file is too small compared to FP32 units (can’t hide memory latencies well)
• divergent control flow is also a big issue.

Now I’m better understanding the limitations of hardware vs this application (ray-tracing).

(double posting removed)