Race condition in for loop Help!

wesarmour · September 9, 2008, 2:13pm

Hi all,

I have a code that works when the block size is greater or equal to the number of threads I want to run. i.e. if I only run 1 block.
Unfortunately (for me :( ) if I change the block size so that more than one block runs my code doesn’t converge on an answer.

I’m sure this is because of a race condition, but I cant spot what I’m doing wrong.
I have the following…

(kernel pseudocode…)

int i = blockIdx.x*blockDim.x+threadIdx.x;

x1[i] = constant*x1[i];

#pragma unroll
for(id1_d = 1; id1_d <= Ndim; id1_d++) {
#pragma unroll
for(id2_d = 1; id2_d <= Ndim; id2_d++) {

        x1[i] = x1[i] - array[i][id1][id2];

}
}

I’m finding that I get the correct results if I have one block, but more than one block gives errors.
I thought I’d avoided the race condition caused by multiple threads of id1 and id2 all mapping back to x1[i] by using #pragma unroll

Any thoughts???

Thanks,

Wes.

P.S. Actual code…

x1_d[i] = cuCmulf((cuFloatComplex)make_cuFloatComplex(bmass2, 0.0), p_d[i]);
__syncthreads();

#pragma unroll
for(id1_d = 1; id1_d <= Ndim; id1_d++) {
#pragma unroll
for(id2_d = 1; id2_d <= Ndim; id2_d++) {
x1_d[i] = cuCaddf(cuCsubf(x1_d[i],
cuCmulf(expv1d_d[i+(kvol*(id1_d-1))],
cuCsubf(cuCmulf(expv1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))],
p_d[ishf1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]]),
cuCmulf(expvConj1d_d[ishb1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]+(kvol*(id2_d-1))],
p_d[ishb1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]])))),
cuCmulf(expvConj1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id1_d-1))],
cuCsubf(cuCmulf(expv1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))],
p_d[ishf1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]]),
cuCmulf(expvConj1d_d[ishb1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]+(kvol*(id2_d-1))],
p_d[ishb1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]]))));
__syncthreads();
printf(“\n%d %d %d %d %d %f %f”, id1_d, id2_d, i, tid, blockIdx.x, cuCrealf(x1_d[i]), cuCimagf(x1_d[i]));
}
}

Sorry for the mess!!!

theMarix · September 9, 2008, 2:29pm

there is no defined scheduling in between blocks, therefore the date you are reading after the syncthreads is probably not even written by the other blocks. __synchthreads only syncs threads within your current block. However even within a block I would not rely on global memory read after writes between threads.

wesarmour · September 9, 2008, 2:53pm

Thanks for the reply.

I didn’t realise that __syncthreads() only applied to a block.
I’m not sure that explains what I’m seeing though???
I thought that a thread would be launched and work on one data element (say in my case x1[5])
If that is the case then surly it doesn’t matter in what order they are executed, only that all threads are finished in a block when I try to operate on them all???
What am I missing??? :huh:

Thanks again,

Wes.

wesarmour · September 9, 2008, 2:59pm

Sorry theMarix I now get what you are trying to tell me. :">

I can access values that exist outside my block via my p_d matrix and these values haven’t been calculated yet because that block hasn’t executed.

So is there a way around this without leaving my kernel???

Thanks again,

Wes.

E.D_Riedijk · September 9, 2008, 7:10pm

Nope, you will need to perform multiple kernel calls.

wesarmour · September 9, 2008, 8:06pm

Thanks Denis.

If I’m following you correctly I need to do a

for(id1=1;id1<=Ndim;id1++){

for(id2=1;id2<=Ndim;id2++) {

  kernel<<<dimGrid,dimBlock>>>(id1,id2);

}

I’m not 100% sure why though. Could you explain?

Thanks again,

Wes.

E.D_Riedijk · September 10, 2008, 5:18am

If threads in block 1 depend on outcome of threads in block 0, then you need to make sure that block 0 has finished. The only way to do so is a new kernel call.
I don’t really grok your code enough to be able to suggest some other way to do it (if there exists a way)

wesarmour · September 10, 2008, 7:40am

Thanks denis.

This is what I thought you meant. So that’s what I’ve done. :)

Do I need to use streams to ensure that the kernel for, say, id1=1,id2=1 has finished before my for loop launches the kernel for id1=1,id2=2??? External Media

The programming guide seems to say kernel calls are asynchronus so return before the device has completed its threads, but further on it says (in the streams bit) kernel calls are assigned to the default stream 0??? :unsure:

Thanks for the help!

Wes.

E.D_Riedijk · September 10, 2008, 11:49am

you can just call your kernels. They get run after eachother, in the order they are submitted.