I’m getting inconsistent results across multiple runs of the same cuda code. I’m thinking that I may have a race condition in the code that I’m posting below, but I’m somewhat unsure as I am a noob to CUDA.
You’re right that lack of syncthreads() could cause problems, but you also likely know you’re using the efficient implicit syncthreads by working at the warp level of just 32 threads.
Just as a first guess, you may need to declare the pML_sum array as being volatile to prevent the compiler from making too many assumptions.
Obligatory microoptimizations:
It won’t affect your correctness, but you’ve unrolled your reduction to add 3 values at once, not one. For general reduction, this can be a win since the extra add is usually cheaper than a syncthreads(). But with warp-level reduction, you’re better off doing the simpler power-of-2 unrolled version and saving the extra math op and shared memory read.
Finally, it’s likely you only care about pML_sum[0]… it looks like a classic reduction.
If you don’t care about garbage values in pML_sum[1…63] then you can simply remove the "if tx<16, if tx<8, if tx<4, if tx<2 if tx<1) tests. You still need the first tx<32 test though.
You’re right that lack of syncthreads() could cause problems, but you also likely know you’re using the efficient implicit syncthreads by working at the warp level of just 32 threads.
Just as a first guess, you may need to declare the pML_sum array as being volatile to prevent the compiler from making too many assumptions.
Obligatory microoptimizations:
It won’t affect your correctness, but you’ve unrolled your reduction to add 3 values at once, not one. For general reduction, this can be a win since the extra add is usually cheaper than a syncthreads(). But with warp-level reduction, you’re better off doing the simpler power-of-2 unrolled version and saving the extra math op and shared memory read.
Finally, it’s likely you only care about pML_sum[0]… it looks like a classic reduction.
If you don’t care about garbage values in pML_sum[1…63] then you can simply remove the "if tx<16, if tx<8, if tx<4, if tx<2 if tx<1) tests. You still need the first tx<32 test though.
Thanks for your prompt reply. This issue seems to be resolved, although after using the fix that you suggested(declare pML_sum as volatile), I found that there was a problem with the variable exponent giving inconsistent results as well. A strategically placed __syncthreads() seems to have fixed the issue as I am now getting consistent results across runs.
/* compute offset based on interleaving index */
zReal = zReal0 + iLvIdx0[0]*nR;
zImag = zImag0 + iLvIdx0[0]*nR;
vReal = vReal0 + iLvIdx0[0]*nR*nT;
vImag = vImag0 + iLvIdx0[0]*nR*nT;
// #pragma unroll
for (j = 0; j < nR; j++){
xr = 0.; xi = 0.;
// #pragma unroll
for (l = 0; l < nT; l++){
/* get the correct refSym */
ref_sym_real = refSyms_r[tx*nSperGF*nT + k*nT + l];
ref_sym_imag = refSyms_i[tx*nSperGF*nT + k*nT + l];
/* Get the channel */
idx = nR*l + j;
vr = vReal[idx];
vi = vImag[idx];
/* expected received value = channel * symbol */
xr += vr * ref_sym_real - vi * ref_sym_imag;
xi += vr * ref_sym_imag + vi * ref_sym_real;
}
/* take difference between actual and expected received value minus noise */
I don’t really understand why this __syncthreads() needs to be there though. The variable exponent is being calculated by the same thread that is calculating the quantities that make up exponent, so I don’t see why there would be race condition. If anyone can shed any light on this issue, I would greatly appreciate it!
Thanks for your prompt reply. This issue seems to be resolved, although after using the fix that you suggested(declare pML_sum as volatile), I found that there was a problem with the variable exponent giving inconsistent results as well. A strategically placed __syncthreads() seems to have fixed the issue as I am now getting consistent results across runs.
/* compute offset based on interleaving index */
zReal = zReal0 + iLvIdx0[0]*nR;
zImag = zImag0 + iLvIdx0[0]*nR;
vReal = vReal0 + iLvIdx0[0]*nR*nT;
vImag = vImag0 + iLvIdx0[0]*nR*nT;
// #pragma unroll
for (j = 0; j < nR; j++){
xr = 0.; xi = 0.;
// #pragma unroll
for (l = 0; l < nT; l++){
/* get the correct refSym */
ref_sym_real = refSyms_r[tx*nSperGF*nT + k*nT + l];
ref_sym_imag = refSyms_i[tx*nSperGF*nT + k*nT + l];
/* Get the channel */
idx = nR*l + j;
vr = vReal[idx];
vi = vImag[idx];
/* expected received value = channel * symbol */
xr += vr * ref_sym_real - vi * ref_sym_imag;
xi += vr * ref_sym_imag + vi * ref_sym_real;
}
/* take difference between actual and expected received value minus noise */
I don’t really understand why this __syncthreads() needs to be there though. The variable exponent is being calculated by the same thread that is calculating the quantities that make up exponent, so I don’t see why there would be race condition. If anyone can shed any light on this issue, I would greatly appreciate it!