Short and Long scoreboard stall

Hi everyone,
I have recently implemented a kernel for a CFD program:

SetTDState_rhoe_Kernel(su2double *d_DENSITY,  su2double *d_ENERGY,su2double *d_PRESSURE, su2double *d_TEMPERATURE,
						su2double *d_SOUNDSPEED2, su2double *d_DPDRHOE, su2double *d_DPDERHO, su2double *d_DTDERHO, su2double *d_DTDRHOE, su2double *d_ENTROPY,su2double *d_ZED,
						su2double Gamma_Minus_One, su2double Gas_Constant, su2double a, su2double b,
						unsigned long nPoint){

	int id= blockIdx.x * blockDim.x  + threadIdx.x;

// VERSIONE BASE

	if(id<nPoint){
	su2double Density = d_DENSITY[id];
	su2double Energy = d_ENERGY[id];

	d_PRESSURE[id] = Gamma_Minus_One * Density / (1.0 - Density * b) * (Energy + Density * a) - a * Density * Density;
	d_TEMPERATURE[id] = (d_PRESSURE[id] + Density * Density * a) * ((1.0 - Density * b) / (Density * Gas_Constant));
	d_ENTROPY[id]= Gas_Constant * (log(d_TEMPERATURE[id]) / Gamma_Minus_One + log(1.0 / Density - b));
	d_DPDERHO[id] = Density * Gamma_Minus_One / (1.0 - Density * b);
	d_DPDRHOE[id] = Gamma_Minus_One / (1.0 - Density * b) *
				 ((Energy + 2.0 * Density * a) + Density * b * (Energy + Density * a) / (1.0 - Density * b)) -
				 2.0 * Density * a;
	d_DTDRHOE[id] = Gamma_Minus_One / Gas_Constant * a;
	d_DTDERHO[id] = Gamma_Minus_One / Gas_Constant;
	d_SOUNDSPEED2[id]= d_DPDRHOE[id] + d_PRESSURE[id] / (Density * Density) * d_DPDERHO[id];
	d_ZED[id] = d_PRESSURE[id] / (Gas_Constant * d_TEMPERATURE[id] * Density);
}

where nPoint is the points in the computational grid for every singole iteration and d_SOMETHING[id] are arrays in which I save the calculated thermodynamic variables. The program works and now i’m tryng to optimize it.
For the arrays in the host side, i use the Unified Virtual Addressing (UVA) because after several tests I saw that for the graphics card in use on my laptop (RTX3060), this is the best configuration.
But when i profile the program i have "long_scoreboard stall " in:
d_PRESSURE[id] = Gamma_Minus_One * Density / (1.0 - Density * b) * (Energy + Density * a) - a * Density * Density;
line after Density and Energy load from global memory.
After that i have a lot of short_scoreboard stall and my question is, what types of stall are they, because in internet I couldn’t find a useful explanation and how can I try to avoid themscoreboard stall.
I am also attaching the result of the kernel profiling.
UVA_BASE_1_256.tar.xz (1.1 MB)

Profiler related questions usually get better / faster replies in the profiler forum:
https://forums.developer.nvidia.com/c/development-tools/nsight-compute/

Three relevant hits from a 30 second Google search:

One thing to consider, which may or may not be related to the stalls, is whether you require double precision?

Looking at the Instruction Throughput info here, for the Compute Capability of your card, 8.6, 64-bit floating-point add, multiply, multiply-add only have a throughput of 2 ops/cycle. Float equivalents run at 128 ops/cycle.

Keep in mind that the CUDA compiler is conservative when it comes to re-associating floating-point expressions, and will basically only apply FMA contraction. For single-precision computation, specifying -use_fast_math can often improve performance at the expense of accuracy.

For example, in this code there are, from a cursory glance, at least four instances of a double-precision division by (1.0 - Density * b) and it would help performance to compute the reciprocal double r = 1.0 / (1.0 - Density * b) once and then multiply by this. The compiler does not pull divisions apart like that automagically because (usually small) numerical differences would result. If that is OK, one can apply such transformations manually.