Kernel with very low eligible warps despite fully coalesced memory access

Hi all,

I am using Nsight Compute to benchmark a compute-intensive kernel. The benchmark report says the following about the scheduler statistics:

Issue slot utilisation: Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only issues an instruction every 28.8 cycles. This might leave hardware resources underutilized and may lead to less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of 3.52 active warps per scheduler, but only an average of 0.04 warps were eligible per cycle.”

So I looked at the warp statistics which say:

long_scoreboard: On average, each warp of this kernel spends 79.3 cycles being stalled waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. This represents about 78.3% of the total average of 101.4 cycles between issuing two instructions.”

This indicates a latency issue. In my kernel, such a latency issue could only arise if the warps are waiting for reads from global memory (I’m not using any local*, surface or texture memory in my code), but my kernel should have coalesced memory access because the threads read from memory with the following access pattern:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

double a = my_struct.a[idx];
double b = my_struct.b[idx];
double c = my_struct.c[idx];
double d = my_struct.d[idx];

// a, b, c, d used for many complex maths operations (a Riemann solver for computational fluid dynamics)

Where my_struct is a struct-of-arrays containing my data (fluid state and others).

My question is, why does the benchmarking report suggest that there is a latency issue if my kernel should have coalesced memory access, which I assume would obviate latency issues? Could it be that because my kernel is so compute-intensive, most of the warps are busy with compute and the scheduler cannot issue additional instructions?

I’ve also inspected the source code and navigated to the line with the highest hits of stall_long_sb (following the advice in this post) but that line is trivial, boiling down to something like return (value < tolerance_val) ? 0 : a / b;. In this line, I assume the issue is rather in the latency of loading a and b**, but I can’t see why there would be a latency issue in the first place. Also, this line says 272k instructions executed and 1026 long_stall_sb. Does this indicate that there are actually very few stalls compared to the total instructions executed?

*I claimed I’m not using local memory, but could register spilling lead to the use of local memory, which I believe boils down to accessing global memory (as mentioned here) which would in turn explain the latency?

**Edit: I further looked at the SASS code and register dependencies after reading this post, confirming that the stall is due to reading data from my_struct. But again, the read pattern (as shown in the code block above) should lead to a coalesced memory transaction, which I imagine would obviate any latency issues. I’m quite stuck at this point, because my kernel boils down to reading from a 1D array in a fully coalesced manner (with adjacent threads reading contiguous memory locations) and then performing many maths operations. How could latency even be an issue here?

This is my first time properly using using Nsight Compute and I’ve been enjoying learning more deeply about the achitecture. Any thoughts about my problem would be appreciated.

PS: Here is some other potentially useful information from the report:

Compute throughput: 72%
Memory throughput: 3.6%
Elapsed cycles: 465k
Active cycles: 387k
Theoretical occupancy: 50%
Registers per thread: 96

High compute throughout: Compute is more heavily utilized than Memory: Look at the Compute Workload Analysis report section to see what the compute pipelines are spending their time doing. Also, consider whether any computation is redundant and could be reduced or moved to look-up tables.”

FP64/32 utilisation: The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved close to 0% of this device’s fp32 peak performance and 49% of its fp64 peak performance. If Compute Workload Analysis determines that this kernel is fp64 bound, consider using 32-bit precision floating point operations to improve its performance.”

Very high utilisation: FP64 is the highest-utilized pipeline (86.8%). It executes 64-bit floating point operations. The pipeline is over-utilized and likely a performance bottleneck.”

Occupancy limiters: This kernel’s theoretical occupancy (50.0%) is limited by the number of required registers.”

I do not see any indication that the kernel is slowed down by memory access.

The output from the profiler suggests that the kernel is doing mostly (roughly 90% of instructions) computation on double data (FP64), on a consumer GPU with low double-precision throughput, and this leads to low instruction slot utilization: In the time one FP64 operation executes one could also execute 32 FP32 instructions.

You might want to switch to a professional GPU with high FP64 throughput, or investigate if your code can make do with float computation.

Thank you very much for your reply.

Using FP32 might make my kernel run faster, but I still don’t see why there are so few eligible warps: my kernel reaches only 49% of peak FP64 throughout, and if the FP64 pipeline was oversubscribed, then wouldn’t the warps report stalls due to insufficient compute resources rather than waiting for L1TEX fetches?

As per the profiler, the biggest issue seems to the lack of eligible warps: there are 3.5 active warps per scheduler, but only 0.04 eligible warps. This seems to be because each warp spends on average nearly 80 cycles on an L1TEX operation. If compute was the issue, why are the warps reporting stalls due to L1TEX operations and not insufficient compute resources?

Thank you again.

There is nothing about coalesced loads that obviate latency issues. Coalescing results in more efficient use of the memory pipe. However the latency to global memory is still present and will have a lower bound typically of ~100 cycles or more. A transaction issued to global memory will have the same latency whether it originated from a coalesced request or an uncoalesced request.

I note that your kernel has above-minimum register usage. That might also be correlated to register spilling. To rule in/out local memory as a concern, I would compile with -Xptxas=-v to inspect amount of spill activity.

If all instructions were FP64 operations, I would expect the profiler to report one instruction issued per every 32 cycles on this particular GPU (“ratio of peak float (fp32) to double (fp64) performance on this device is 32”). Since the actual number is 28.8 cycles per instruction issued, I concluded that 90% of the dynamic instruction count consists of FP64 operations. I am not a profiler expert, but I believe that the low number of warps eligible per cycle is a direct consequence of being limited by low FP64 throughput.

Nothing in the code snippet provided suggests a performance issue due to memory access. The data uses an SOA arrangement, with each array accessed using the TID pattern, i.e. access is contiguous across the threads in a warp. This should result in the best possible throughput. But at the start of the kernel there will be unavoidable latency as data for ‘a,b,c,d’ is pulled in from global memory before computation can start (I am assuming my_struct is a kernel argument, passed via constant memory) . The kernel is described as compute intensive, but you might want to examine slightly more detailed using the roofline model: memory traffic and FP64 operations required by this kernel, set in relationship to the throughput (memory, FP64) provided by the GPU being used.

It is entirely possible that there are other issues, including those related to memory access, in parts of the code not shown. Generally speaking, it is hard to make an exact diagnosis when significant amounts of information are being withheld by the asker and no repro code is provided.

this may also be of interest, although I don’t think it directly answers any of your questions.

Hi both, apologies for the late reply, I’ve spent some of this time building a better mental model about GPU programming and hardware to better understand the performance guidelines in the CUDA programming guide.

Switching to FP32 significantly improved the number of eligible warps in the kernel, so I have marked @njuffa’s last reply as the solution because it has cleared up my confusion for now; more details below if you wish to read.

Essentially, I did not properly understand what is meant by the phrase “the scheduler can issue one instruction per clock cycle for one warp”. I thought this meant that the scheduler can issue one instruction per clock cycle for one warp no matter what. Namely, I thought the scheduler should be able to issue an FP64 instruction per clock cycle for a warp no matter what, not realising that actually this would only be possible if the scheduler was not being held back by the GPU’s peak FP64 instruction throughput…

I thought you thought that 90% of the instruction mix are FP64 operations not because of the back-of-the-envelope calculations you wrote about above, but rather simply from this message from the profiler (included in my OP):

Very high utilisation: FP64 is the highest-utilized pipeline (86.8%). It executes 64-bit floating point operations. The pipeline is over-utilized and likely a performance bottleneck.”

Nonetheless, I was curious to explore your belief that the number of eligible warps is being limited by low FP64 throughput. First, as you suggested:

I analysed the kernel using the roofline model and got this:

Where the red dot is the achieved FP64 throughput; as you can see here, the red dot is very close to the lower roofline which is the peak FP64 throughput, so it does look like my kernel is limited by low FP64 throughput.

Further corrobarating this is Table 3 in the CUDA programming guide: my GPU is the RTX 2070 with compute capability 7.5, which as per Table 3 has an FP64 throughput of 2 (!) operations per clock cycle per SM; dividing this number by 32 to go from operation throughput to instruction throughput gives an FP64 instruction throughput of 1/16 per clock cycle per SM 😂

On the other hand, the FP32 instruction throughput is 2 per clock cycle per SM. This instruction throughput should significantly increase the number of eligible warps in my kernel if it was indeed limited by FP64 throughput.

With this in mind, I compiled my code with 32-bit precision, re-profiled the kernel and got the following scheduler stats (shown in parentheses are the stats gotten using the 64-bit version):

GPU max warps per scheduler: 8 (8)
Theoretical warps per scheduler: 6 (4)
Active warps per scheduler: 5.50 (3.84)
Eligible warps per scheduler: 1.29 (0.05)
Issued warp per scheduler: 0.62 (0.04)*****

The stats show that the number of eligible warps is a lot higher using 32-bit precision, presumably because the peak FP32 throughput is a lot higher for this GPU and supporting the belief that my kernel was previously limited by low FP64 throughput.

I did a similar exercise of comparing 32- vs 64-bit precision by profiling my kernel on my laptop with a GTX 1050, which like the RTX 2070 has a similarly stark difference in peak FP32 vs FP64 throughput. This exercise (for which using Nsight Compute is not supported so I had to use the Visual Profiler) seemed to give me broadly similar results; unfortunately I don’t have my have laptop with me right now to include the results here.

*****I am guessing that I should improve instruction throughput in this kernel and get the number of issued warps per scheduler to 1. However, I won’t be needing to do this right now as this profiling exercise has been enough for me learn significantly more about CUDA programming.

Yes, I understand; I think my using “obviate latency issues” was the wrong phrase. What I meant to say was that I wasn’t at all expecting the profiler to report any issues about latency since I have coalesced loads, but from what you said above and from @njuffa pointing out that there is unavoidable latency at the start of the kernel when data is pulled in for a, b, c, d, I suppose its not too strange to see.

Thanks for this suggestion.

I’d already read this when I was troubleshooting last time, but thanks anyway for linking to it; glad to see I was at least vaguely reading about the correct things…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.