Hi everyone,
I have recently implemented a kernel for a CFD program:
SetTDState_rhoe_Kernel(su2double *d_DENSITY, su2double *d_ENERGY,su2double *d_PRESSURE, su2double *d_TEMPERATURE,
su2double *d_SOUNDSPEED2, su2double *d_DPDRHOE, su2double *d_DPDERHO, su2double *d_DTDERHO, su2double *d_DTDRHOE, su2double *d_ENTROPY,su2double *d_ZED,
su2double Gamma_Minus_One, su2double Gas_Constant, su2double a, su2double b,
unsigned long nPoint){
int id= blockIdx.x * blockDim.x + threadIdx.x;
// VERSIONE BASE
if(id<nPoint){
su2double Density = d_DENSITY[id];
su2double Energy = d_ENERGY[id];
d_PRESSURE[id] = Gamma_Minus_One * Density / (1.0 - Density * b) * (Energy + Density * a) - a * Density * Density;
d_TEMPERATURE[id] = (d_PRESSURE[id] + Density * Density * a) * ((1.0 - Density * b) / (Density * Gas_Constant));
d_ENTROPY[id]= Gas_Constant * (log(d_TEMPERATURE[id]) / Gamma_Minus_One + log(1.0 / Density - b));
d_DPDERHO[id] = Density * Gamma_Minus_One / (1.0 - Density * b);
d_DPDRHOE[id] = Gamma_Minus_One / (1.0 - Density * b) *
((Energy + 2.0 * Density * a) + Density * b * (Energy + Density * a) / (1.0 - Density * b)) -
2.0 * Density * a;
d_DTDRHOE[id] = Gamma_Minus_One / Gas_Constant * a;
d_DTDERHO[id] = Gamma_Minus_One / Gas_Constant;
d_SOUNDSPEED2[id]= d_DPDRHOE[id] + d_PRESSURE[id] / (Density * Density) * d_DPDERHO[id];
d_ZED[id] = d_PRESSURE[id] / (Gas_Constant * d_TEMPERATURE[id] * Density);
}
where nPoint is the points in the computational grid for every singole iteration and d_SOMETHING[id] are arrays in which I save the calculated thermodynamic variables. The program works and now i’m tryng to optimize it.
For the arrays in the host side, i use the Unified Virtual Addressing (UVA) because after several tests I saw that for the graphics card in use on my laptop (RTX3060), this is the best configuration.
But when i profile the program i have "long_scoreboard stall " in:
d_PRESSURE[id] = Gamma_Minus_One * Density / (1.0 - Density * b) * (Energy + Density * a) - a * Density * Density;
line after Density and Energy load from global memory.
After that i have a lot of short_scoreboard stall and my question is, what types of stall are they, because in internet I couldn’t find a useful explanation and how can I try to avoid themscoreboard stall.
I am also attaching the result of the kernel profiling.
UVA_BASE_1_256.tar.xz (1.1 MB)