Hello everyone,
I recently installed the new 3.1 drivers and the cuda 3.1 toolkit. My code (CFD algorithm) used to compile and give correct answers with 3.0. However, with Cuda 3.1, the code still compiles without any problems but produces wrong results. I was able to run the SDK examples on 3.1 without issues.
While trying to investigate where the problem was coming from, I noticed that by manually unrolling a “for” loop inside a kernel, the code would produce correct results. Here are the two variants of a portion of the code I used, one giving the correct answer, the other not. Unless I miss something really obvious, these two codes should do the exact same thing, but they give different results. The rest of the code is EXACTLY the same.
The code that doesn’t give the correct results with Cuda 3.1 but works with Cuda 3.0:
for(int j = 0; j < 4; j++)
{
if(j == 0)
{
fl_temp = norm_x*(rho_l*vx_l) + norm_y*(rho_l*vy_l);
fr_temp = norm_x*(rho_r*vx_r) + norm_y*(rho_r*vy_r);
ql_temp = rho_l;
qr_temp = rho_r;
} else if(j == 1) {
fl_temp = norm_x*(p_l+(rho_l*vx_l*vx_l)) + norm_y*(rho_l*vy_l*vx_l);
fr_temp = norm_x*(p_r+(rho_r*vx_r*vx_r)) + norm_y*(rho_r*vy_r*vx_r);
ql_temp = rho_l*vx_l;
qr_temp = rho_r*vx_r;
} else if(j == 2) {
fl_temp = norm_x*(rho_l*vx_l*vy_l) + norm_y*(p_l+(rho_l*vy_l*vy_l));
fr_temp = norm_x*(rho_r*vx_r*vy_r) + norm_y*(p_r+(rho_r*vy_r*vy_r));
ql_temp = rho_l*vy_l;
qr_temp = rho_r*vy_r;
} else {
fl_temp = norm_x*(vx_l*(energy_l + p_l)) + norm_y*(vy_l*(energy_l + p_l));
fr_temp = norm_x*(vx_r*(energy_r + p_r)) + norm_y*(vy_r*(energy_r + p_r));
ql_temp = energy_l;
qr_temp = energy_r;
}
// Computing common normal flux
fg_norm = 0.5*((fl_temp+fr_temp)
-(vn_av_mag+c_av)*(qr_temp-ql_temp));
// Store transformed normal flux for left flux point
d_norm_tfg_con_ed_fgpts[index_l] =
fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_l]; index_l+= n_ed_fgpts_per_cell;
// Store transformed normal flux for right flux point
d_norm_tfg_con_ed_fgpts[index_r] =
-fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_r]; index_r += n_ed_fgpts_per_cell;
}
The code that works fine with CUDA 3.1 and 3.0
// ------------------------------
// j=0
// ------------------------------
fl_temp = norm_x*(rho_l*vx_l) + norm_y*(rho_l*vy_l);
fr_temp = norm_x*(rho_r*vx_r) + norm_y*(rho_r*vy_r);
ql_temp = rho_l;
qr_temp = rho_r;
// Computing common normal flux
fg_norm = 0.5*((fl_temp+fr_temp)
-(vn_av_mag+c_av)*(qr_temp-ql_temp));
// Store transformed normal flux for left flux point
d_norm_tfg_con_ed_fgpts[index_l] =
fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_l]; index_l += n_ed_fgpts_per_cell;
// Store transformed normal flux for right flux point
d_norm_tfg_con_ed_fgpts[index_r] =
-fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_r]; index_r += n_ed_fgpts_per_cell;
// ------------------------
// j=1
// ------------------------
fl_temp = norm_x*(p_l+(rho_l*vx_l*vx_l)) + norm_y*(rho_l*vy_l*vx_l);
fr_temp = norm_x*(p_r+(rho_r*vx_r*vx_r)) + norm_y*(rho_r*vy_r*vx_r);
ql_temp = rho_l*vx_l;
qr_temp = rho_r*vx_r;
// Computing common normal flux
fg_norm = 0.5*((fl_temp+fr_temp)
-(vn_av_mag+c_av)*(qr_temp-ql_temp));
// Store transformed normal flux for left flux point
d_norm_tfg_con_ed_fgpts[index_l] =
fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_l]; index_l += n_ed_fgpts_per_cell;
// Store transformed normal flux for right flux point
d_norm_tfg_con_ed_fgpts[index_r] =
-fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_r]; index_r += n_ed_fgpts_per_cell;
// ------------------------
// j=2
// ------------------------
fl_temp = norm_x*(rho_l*vx_l*vy_l) + norm_y*(p_l+(rho_l*vy_l*vy_l));
fr_temp = norm_x*(rho_r*vx_r*vy_r) + norm_y*(p_r+(rho_r*vy_r*vy_r));
ql_temp = rho_l*vy_l;
qr_temp = rho_r*vy_r;
// Computing common normal flux
fg_norm = 0.5*((fl_temp+fr_temp)
-(vn_av_mag+c_av)*(qr_temp-ql_temp));
// Store transformed normal flux for left flux point
d_norm_tfg_con_ed_fgpts[index_l] =
fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_l]; index_l += n_ed_fgpts_per_cell;
// Store transformed normal flux for right flux point
d_norm_tfg_con_ed_fgpts[index_r] =
-fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_r]; index_r += n_ed_fgpts_per_cell;
// ------------------------
// j=3
// ------------------------
fl_temp = norm_x*(vx_l*(energy_l + p_l)) + norm_y*(vy_l*(energy_l + p_l));
fr_temp = norm_x*(vx_r*(energy_r + p_r)) + norm_y*(vy_r*(energy_r + p_r));
ql_temp = energy_l;
qr_temp = energy_r;
// Computing common normal flux
fg_norm = 0.5*((fl_temp+fr_temp)
-(vn_av_mag+c_av)*(qr_temp-ql_temp));
// Store transformed normal flux for left flux point
d_norm_tfg_con_ed_fgpts[index_l] =
fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_l];
// Store transformed normal flux for right flux point
d_norm_tfg_con_ed_fgpts[index_r] =
-fg_norm*d_mag_norm_dot_jac_inv_ed_fgpts[index2_r];
Am I missing something really obvious? Also, the kernel that works fine uses 68 registers, while the version that doesn’t work on 3.1 uses only 54 registers. I would really appreciate your help.