Long/Short Scoreboard Stall

Hi everyone,
I’m using cuda to implement a solver for the thermodynamics part of a CFD program. Going to profile the program I get high stall values ​​due to long/short scoreboard and I would like to try to reduce them and optimize the program , can anyone give me some advice?

The code is the following:

And the yellow line is the largest source of long scoreboard stalls:

and in reference to line 95(yellow):

It’s not easy to immediately understand and provide guidance on how to refactor code just based on these screenshots.

One important thing to note is that long scoreboards don’t necessarily impact performance if other warps can be running while this one waits for data. I would recommend looking at Warp State Statistics on the details page and see if you have low issues slot utilization. If not, you may not have performance impacts from these long scoreboards. Also look for other rules and warnings that Nsight Compute is reporting as the bottleneck to make sure this is the one to focus on.

However, having said that, one common issue that causes this behavior is memory access patterns that don’t allow enough time for accesses to resolve before their use. Looking at the assembly, we see on line 14 R2 being written and then used on line 18. This could be the fetch causing the long scoreboards. It looks like the read of d_DENSITY[id] on line 95. You can try to find ways to read this earlier and insert other instructions before it is needed in the calculations. Sometimes loop unrolling can help with this as well.