Hi all,
I am using Nsight Compute to benchmark a compute-intensive kernel. The benchmark report says the following about the scheduler statistics:
“Issue slot utilisation: Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 28.8 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of 3.52 active warps per scheduler, but only an average of 0.04 warps were eligible per cycle.”
So I looked at the warp statistics which say:
“long_scoreboard: On average, each warp of this kernel spends 79.3 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. This represents about 78.3% of the total average of 101.4 cycles between issuing two instructions.”
This indicates a latency issue. In my kernel, such a latency issue could only arise if the warps are waiting for reads from global memory (I’m not using any local*, surface or texture memory in my code), but my kernel should have coalesced memory access because the threads read from memory with the following access pattern:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
double a = my_struct.a[idx];
double b = my_struct.b[idx];
double c = my_struct.c[idx];
double d = my_struct.d[idx];
// a, b, c, d used for many complex maths operations (a Riemann solver for computational fluid dynamics)
Where my_struct is a struct-of-arrays containing my data (fluid state and others).
My question is, why does the benchmarking report suggest that there is a latency issue if my kernel should have coalesced memory access, which I assume would obviate latency issues? Could it be that because my kernel is so compute-intensive, most of the warps are busy with compute and the scheduler cannot issue additional instructions?
I’ve also inspected the source code and navigated to the line with the highest hits of stall_long_sb (following the advice in this post) but that line is trivial, boiling down to something like return (value < tolerance_val) ? 0 : a / b;. In this line, I assume the issue is rather in the latency of loading a and b**, but I can’t see why there would be a latency issue in the first place. Also, this line says 272k instructions executed and 1026 long_stall_sb. Does this indicate that there are actually very few stalls compared to the total instructions executed?
*I claimed I’m not using local memory, but could register spilling lead to the use of local memory, which I believe boils down to accessing global memory (as mentioned here) which would in turn explain the latency?
**Edit: I further looked at the SASS code and register dependencies after reading this post, confirming that the stall is due to reading data from my_struct. But again, the read pattern (as shown in the code block above) should lead to a coalesced memory transaction, which I imagine would obviate any latency issues. I’m quite stuck at this point, because my kernel boils down to reading from a 1D array in a fully coalesced manner (with adjacent threads reading contiguous memory locations) and then performing many maths operations. How could latency even be an issue here?
This is my first time properly using using Nsight Compute and I’ve been enjoying learning more deeply about the achitecture. Any thoughts about my problem would be appreciated.
PS: Here is some other potentially useful information from the report:
Compute throughput: 72%
Memory throughput: 3.6%
Elapsed cycles: 465k
Active cycles: 387k
Theoretical occupancy: 50%
Registers per thread: 96
“High compute throughout: Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see what the compute pipelines are spending their time doing. Also, consider whether any computation is redundant and could be reduced or moved to look-up tables.”
“FP64/32 utilisation: The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved close to 0% of this device’s fp32 peak performance and 49% of its fp64 peak performance. If Compute Workload Analysis determines that this kernel is fp64 bound, consider using 32-bit precision floating point operations to improve its performance.”
“Very high utilisation: FP64 is the highest-utilized pipeline (86.8%). It executes 64-bit floating point operations. The pipeline is over-utilized and likely a performance bottleneck.”
“Occupancy limiters: This kernel’s theoretical occupancy (50.0%) is limited by the number of required registers.”
