Possible race condition?

Hi all,

I’m getting inconsistent results across multiple runs of the same cuda code. I’m thinking that I may have a race condition in the code that I’m posting below, but I’m somewhat unsure as I am a noob to CUDA.

[codebox]

int tx = threadIdx.x;

shared float pML_local[256];

shared float pML_sum[64];

pML_local[tx] = exp(-exponent); /* assume exponent is already calculated correctly */

__syncthreads();

    if (tx < 32) {

        pML_sum[tx]    = pML_local[tx]    + pML_local[tx+64] + pML_local[tx+128] + pML_local[tx+192];

        pML_sum[tx+32] = pML_local[tx+32] + pML_local[tx+96] + pML_local[tx+160] + pML_local[tx+224];

		if (tx < 16) {

			pML_sum[tx] += pML_sum[tx+16] + pML_sum[tx+32] + pML_sum[tx+48];

		    if (tx < 4) {

				pML_sum[tx] += pML_sum[tx+4] + pML_sum[tx+8] + pML_sum[tx+12];

				if (tx < 1) {

					pML_sum[0] += pML_sum[1] + pML_sum[2] + pML_sum[3];

				}

			}

		}

	}

/* Use the sum to normalize the values. Also, subtract off appropriate amount to ensure final pML_sum = MLthreshold */

    __syncthreads();[/codebox]

I’m thinking I may need a __syncthreads() inside each of the if statements?

Thank you in advance!

You’re right that lack of syncthreads() could cause problems, but you also likely know you’re using the efficient implicit syncthreads by working at the warp level of just 32 threads.

Just as a first guess, you may need to declare the pML_sum array as being volatile to prevent the compiler from making too many assumptions.

Obligatory microoptimizations:

It won’t affect your correctness, but you’ve unrolled your reduction to add 3 values at once, not one. For general reduction, this can be a win since the extra add is usually cheaper than a syncthreads(). But with warp-level reduction, you’re better off doing the simpler power-of-2 unrolled version and saving the extra math op and shared memory read.

Finally, it’s likely you only care about pML_sum[0]… it looks like a classic reduction.
If you don’t care about garbage values in pML_sum[1…63] then you can simply remove the "if tx<16, if tx<8, if tx<4, if tx<2 if tx<1) tests. You still need the first tx<32 test though.

You’re right that lack of syncthreads() could cause problems, but you also likely know you’re using the efficient implicit syncthreads by working at the warp level of just 32 threads.

Just as a first guess, you may need to declare the pML_sum array as being volatile to prevent the compiler from making too many assumptions.

Obligatory microoptimizations:

It won’t affect your correctness, but you’ve unrolled your reduction to add 3 values at once, not one. For general reduction, this can be a win since the extra add is usually cheaper than a syncthreads(). But with warp-level reduction, you’re better off doing the simpler power-of-2 unrolled version and saving the extra math op and shared memory read.

Finally, it’s likely you only care about pML_sum[0]… it looks like a classic reduction.
If you don’t care about garbage values in pML_sum[1…63] then you can simply remove the "if tx<16, if tx<8, if tx<4, if tx<2 if tx<1) tests. You still need the first tx<32 test though.

Thanks for your prompt reply. This issue seems to be resolved, although after using the fix that you suggested(declare pML_sum as volatile), I found that there was a problem with the variable exponent giving inconsistent results as well. A strategically placed __syncthreads() seems to have fixed the issue as I am now getting consistent results across runs.

other [codebox]int bx = blockIdx.x;

int tx = threadIdx.x;

int j, k, l;

int idx;

shared volatile float pML_sum[64];

__shared__ float pML_local[GF_ORDER];

__shared__ int iLvIdx0[2];

__shared__ float detInv_R[1];

__shared__ float sigSq11[1], sigSq22[1], sigSq12_r[1], sigSq12_i[1];

float exponent = 0.;

float ref_sym_real, ref_sym_imag;  

float xr, xi, vr, vi, beta_r[2], beta_i[2];

float* zReal0;

float* zImag0;

float* vReal0;

float* vImag0;

float param;

float beta0_sq,beta1_sq;

/* initialize pointers so that we always have a reference to the start address */

zReal0 = zReal;

zImag0 = zImag;

vReal0 = vReal;

vImag0 = vImag;

/* Only work if your index is less than the number of symbols */

if (bx < dataTonesPerFram) {

     int prod = bx/(dataTonesPerFram/nSperGF); 

   /* offset pML pointer based on which block we're in */

      pML   += bx * GF_ORDER;

/* Sum the squares of the differenecs of recieved signals and their expected values */

  //  #pragma unroll<-----can't unroll these loops because of the texture fetch

    for (k = 0; k < nSperGF; k++){

/* since we need to calculate this offset only once per block,store it in shared memory */

          if(tx == 0){

iLvIdx0[0] = iLvIdx[bxnSperGF + k] + (proddataTonesPerFram);

              iLvIdx0[1] = iLvIdx[bx*nSperGF + k];

              sigSq11[0]   = nsig_r[iLvIdx0[1]*nR*nR];

              sigSq22[0]   = nsig_r[iLvIdx0[1]*nR*nR+3];

              sigSq12_r[0] = nsig_r[iLvIdx0[1]*nR*nR+2];

              sigSq12_i[0] = nsig_i[iLvIdx0[1]*nR*nR+2];

detInv_R[0] = 1/(sigSq11[0]*sigSq22[0] - (sigSq12_r[0]*sigSq12_r[0] + sigSq12_i[0]*sigSq12_i[0]));

}

__syncthreads();

        /* compute offset based on interleaving index  */  

        zReal = zReal0 + iLvIdx0[0]*nR;

        zImag = zImag0 + iLvIdx0[0]*nR;

        vReal = vReal0 + iLvIdx0[0]*nR*nT;

        vImag = vImag0 + iLvIdx0[0]*nR*nT;

// #pragma unroll

        for (j = 0; j < nR; j++){

            xr = 0.; xi = 0.;

         //   #pragma unroll

            for (l = 0; l < nT; l++){

                /* get the correct refSym */

                ref_sym_real = refSyms_r[tx*nSperGF*nT + k*nT + l]; 

                ref_sym_imag = refSyms_i[tx*nSperGF*nT + k*nT + l]; 

/* Get the channel */

                idx = nR*l + j;

                vr = vReal[idx]; 

                vi = vImag[idx];

/* expected received value = channel * symbol */

                xr += vr * ref_sym_real - vi * ref_sym_imag;

                xi += vr * ref_sym_imag + vi * ref_sym_real;

            }

            /* take difference between actual and expected received value minus noise */

beta_r[j] = zReal[j] - xr - n_r[iLvIdx0[1]*nR+j]; //tex2D(nr_tex, j, iLvIdx0[1]);

                beta_i[j] = zImag[j] - xi - n_i[iLvIdx0[1]*nR+j]; //tex2D(ni_tex, j, iLvIdx0[1]);

/* calculate the matrix product in the expenent of the prob model */

        }

        beta0_sq = beta_r[0]*beta_r[0] + beta_i[0]*beta_i[0];

        beta1_sq = beta_r[1]*beta_r[1] + beta_i[1]*beta_i[1];

        param    = (beta_r[0]*beta_r[1] + beta_i[0]*beta_i[1])*sigSq12_r[0] + (beta_i[0]*beta_r[1] - beta_r[0]*beta_i[1])*sigSq12_i[0]; 

        __syncthreads();                                                                                                  /* <-----------------------had to add this in to get consistent result in exponent */

        exponent += detInv_R[0]*(beta0_sq*sigSq22[0] + beta1_sq*sigSq11[0] - 2.*param); 

}

/* Transform into probabilities with an exponential */

    pML_local[tx] = exp(-exponent);

/* Sum the pML values */

    __syncthreads();

    if (tx < 32) {

        pML_sum[tx]    = pML_local[tx]    + pML_local[tx+64] + pML_local[tx+128] + pML_local[tx+192];

        pML_sum[tx+32] = pML_local[tx+32] + pML_local[tx+96] + pML_local[tx+160] + pML_local[tx+224];

		if (tx < 16) {

			pML_sum[tx] += pML_sum[tx+16] + pML_sum[tx+32] + pML_sum[tx+48];

		    if (tx < 4) {

				pML_sum[tx] += pML_sum[tx+4] + pML_sum[tx+8] + pML_sum[tx+12];

				if (tx < 1) {

					pML_sum[0] += pML_sum[1] + pML_sum[2] + pML_sum[3];

				}

			}

		}

	}

/* Use the sum to normalize the values. Also, subtract off appropriate amount to ensure final pML_sum = MLthreshold */

    __syncthreads();

    if (pML_sum[0] == 0){

        pML[tx] = 1. / GF_ORDER;

    } else{

      pML[tx] = MLthreshold*(pML_local[tx] / pML_sum[0]) + (1. - MLthreshold)/GF_ORDER;

    }

}[/codebox]

I don’t really understand why this __syncthreads() needs to be there though. The variable exponent is being calculated by the same thread that is calculating the quantities that make up exponent, so I don’t see why there would be race condition. If anyone can shed any light on this issue, I would greatly appreciate it!

Thanks for your prompt reply. This issue seems to be resolved, although after using the fix that you suggested(declare pML_sum as volatile), I found that there was a problem with the variable exponent giving inconsistent results as well. A strategically placed __syncthreads() seems to have fixed the issue as I am now getting consistent results across runs.

other [codebox]int bx = blockIdx.x;

int tx = threadIdx.x;

int j, k, l;

int idx;

shared volatile float pML_sum[64];

__shared__ float pML_local[GF_ORDER];

__shared__ int iLvIdx0[2];

__shared__ float detInv_R[1];

__shared__ float sigSq11[1], sigSq22[1], sigSq12_r[1], sigSq12_i[1];

float exponent = 0.;

float ref_sym_real, ref_sym_imag;  

float xr, xi, vr, vi, beta_r[2], beta_i[2];

float* zReal0;

float* zImag0;

float* vReal0;

float* vImag0;

float param;

float beta0_sq,beta1_sq;

/* initialize pointers so that we always have a reference to the start address */

zReal0 = zReal;

zImag0 = zImag;

vReal0 = vReal;

vImag0 = vImag;

/* Only work if your index is less than the number of symbols */

if (bx < dataTonesPerFram) {

     int prod = bx/(dataTonesPerFram/nSperGF); 

   /* offset pML pointer based on which block we're in */

      pML   += bx * GF_ORDER;

/* Sum the squares of the differenecs of recieved signals and their expected values */

  //  #pragma unroll<-----can't unroll these loops because of the texture fetch

    for (k = 0; k < nSperGF; k++){

/* since we need to calculate this offset only once per block,store it in shared memory */

          if(tx == 0){

iLvIdx0[0] = iLvIdx[bxnSperGF + k] + (proddataTonesPerFram);

              iLvIdx0[1] = iLvIdx[bx*nSperGF + k];

              sigSq11[0]   = nsig_r[iLvIdx0[1]*nR*nR];

              sigSq22[0]   = nsig_r[iLvIdx0[1]*nR*nR+3];

              sigSq12_r[0] = nsig_r[iLvIdx0[1]*nR*nR+2];

              sigSq12_i[0] = nsig_i[iLvIdx0[1]*nR*nR+2];

detInv_R[0] = 1/(sigSq11[0]*sigSq22[0] - (sigSq12_r[0]*sigSq12_r[0] + sigSq12_i[0]*sigSq12_i[0]));

}

__syncthreads();

        /* compute offset based on interleaving index  */  

        zReal = zReal0 + iLvIdx0[0]*nR;

        zImag = zImag0 + iLvIdx0[0]*nR;

        vReal = vReal0 + iLvIdx0[0]*nR*nT;

        vImag = vImag0 + iLvIdx0[0]*nR*nT;

// #pragma unroll

        for (j = 0; j < nR; j++){

            xr = 0.; xi = 0.;

         //   #pragma unroll

            for (l = 0; l < nT; l++){

                /* get the correct refSym */

                ref_sym_real = refSyms_r[tx*nSperGF*nT + k*nT + l]; 

                ref_sym_imag = refSyms_i[tx*nSperGF*nT + k*nT + l]; 

/* Get the channel */

                idx = nR*l + j;

                vr = vReal[idx]; 

                vi = vImag[idx];

/* expected received value = channel * symbol */

                xr += vr * ref_sym_real - vi * ref_sym_imag;

                xi += vr * ref_sym_imag + vi * ref_sym_real;

            }

            /* take difference between actual and expected received value minus noise */

beta_r[j] = zReal[j] - xr - n_r[iLvIdx0[1]*nR+j]; //tex2D(nr_tex, j, iLvIdx0[1]);

                beta_i[j] = zImag[j] - xi - n_i[iLvIdx0[1]*nR+j]; //tex2D(ni_tex, j, iLvIdx0[1]);

/* calculate the matrix product in the expenent of the prob model */

        }

        beta0_sq = beta_r[0]*beta_r[0] + beta_i[0]*beta_i[0];

        beta1_sq = beta_r[1]*beta_r[1] + beta_i[1]*beta_i[1];

        param    = (beta_r[0]*beta_r[1] + beta_i[0]*beta_i[1])*sigSq12_r[0] + (beta_i[0]*beta_r[1] - beta_r[0]*beta_i[1])*sigSq12_i[0]; 

        __syncthreads();                                                                                                  /* <-----------------------had to add this in to get consistent result in exponent */

        exponent += detInv_R[0]*(beta0_sq*sigSq22[0] + beta1_sq*sigSq11[0] - 2.*param); 

}

/* Transform into probabilities with an exponential */

    pML_local[tx] = exp(-exponent);

/* Sum the pML values */

    __syncthreads();

    if (tx < 32) {

        pML_sum[tx]    = pML_local[tx]    + pML_local[tx+64] + pML_local[tx+128] + pML_local[tx+192];

        pML_sum[tx+32] = pML_local[tx+32] + pML_local[tx+96] + pML_local[tx+160] + pML_local[tx+224];

		if (tx < 16) {

			pML_sum[tx] += pML_sum[tx+16] + pML_sum[tx+32] + pML_sum[tx+48];

		    if (tx < 4) {

				pML_sum[tx] += pML_sum[tx+4] + pML_sum[tx+8] + pML_sum[tx+12];

				if (tx < 1) {

					pML_sum[0] += pML_sum[1] + pML_sum[2] + pML_sum[3];

				}

			}

		}

	}

/* Use the sum to normalize the values. Also, subtract off appropriate amount to ensure final pML_sum = MLthreshold */

    __syncthreads();

    if (pML_sum[0] == 0){

        pML[tx] = 1. / GF_ORDER;

    } else{

      pML[tx] = MLthreshold*(pML_local[tx] / pML_sum[0]) + (1. - MLthreshold)/GF_ORDER;

    }

}[/codebox]

I don’t really understand why this __syncthreads() needs to be there though. The variable exponent is being calculated by the same thread that is calculating the quantities that make up exponent, so I don’t see why there would be race condition. If anyone can shed any light on this issue, I would greatly appreciate it!