Race condition in for loop Help!

Hi all,

I have a code that works when the block size is greater or equal to the number of threads I want to run. i.e. if I only run 1 block.
Unfortunately (for me :( ) if I change the block size so that more than one block runs my code doesn’t converge on an answer.

I’m sure this is because of a race condition, but I cant spot what I’m doing wrong.
I have the following…

(kernel pseudocode…)

int i = blockIdx.x*blockDim.x+threadIdx.x;

x1[i] = constant*x1[i];

#pragma unroll
for(id1_d = 1; id1_d <= Ndim; id1_d++) {
#pragma unroll
for(id2_d = 1; id2_d <= Ndim; id2_d++) {

        x1[i] = x1[i] - array[i][id1][id2];

}
}

I’m finding that I get the correct results if I have one block, but more than one block gives errors.
I thought I’d avoided the race condition caused by multiple threads of id1 and id2 all mapping back to x1[i] by using #pragma unroll

Any thoughts???

Thanks,

Wes.

P.S. Actual code…

x1_d[i] = cuCmulf((cuFloatComplex)make_cuFloatComplex(bmass2, 0.0), p_d[i]);
__syncthreads();

#pragma unroll
for(id1_d = 1; id1_d <= Ndim; id1_d++) {
#pragma unroll
for(id2_d = 1; id2_d <= Ndim; id2_d++) {
x1_d[i] = cuCaddf(cuCsubf(x1_d[i],
cuCmulf(expv1d_d[i+(kvol*(id1_d-1))],
cuCsubf(cuCmulf(expv1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))],
p_d[ishf1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]]),
cuCmulf(expvConj1d_d[ishb1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]+(kvol*(id2_d-1))],
p_d[ishb1d_d[ishf1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]])))),
cuCmulf(expvConj1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id1_d-1))],
cuCsubf(cuCmulf(expv1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))],
p_d[ishf1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]]),
cuCmulf(expvConj1d_d[ishb1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]+(kvol*(id2_d-1))],
p_d[ishb1d_d[ishb1d_d[i+(kvol*(id1_d-1))]+(kvol*(id2_d-1))]]))));
__syncthreads();
printf(“\n%d %d %d %d %d %f %f”, id1_d, id2_d, i, tid, blockIdx.x, cuCrealf(x1_d[i]), cuCimagf(x1_d[i]));
}
}

Sorry for the mess!!!

there is no defined scheduling in between blocks, therefore the date you are reading after the syncthreads is probably not even written by the other blocks. __synchthreads only syncs threads within your current block. However even within a block I would not rely on global memory read after writes between threads.

Thanks for the reply.

I didn’t realise that __syncthreads() only applied to a block.
I’m not sure that explains what I’m seeing though???
I thought that a thread would be launched and work on one data element (say in my case x1[5])
If that is the case then surly it doesn’t matter in what order they are executed, only that all threads are finished in a block when I try to operate on them all???
What am I missing??? :huh:

Thanks again,

Wes.

Sorry theMarix I now get what you are trying to tell me. :">

I can access values that exist outside my block via my p_d matrix and these values haven’t been calculated yet because that block hasn’t executed.

So is there a way around this without leaving my kernel???

Thanks again,

Wes.

Nope, you will need to perform multiple kernel calls.

Thanks Denis.

If I’m following you correctly I need to do a

for(id1=1;id1<=Ndim;id1++){

for(id2=1;id2<=Ndim;id2++) {

  kernel<<<dimGrid,dimBlock>>>(id1,id2);

}

}

I’m not 100% sure why though. Could you explain?

Thanks again,

Wes.

If threads in block 1 depend on outcome of threads in block 0, then you need to make sure that block 0 has finished. The only way to do so is a new kernel call.
I don’t really grok your code enough to be able to suggest some other way to do it (if there exists a way)

Thanks denis.

This is what I thought you meant. So that’s what I’ve done. :)

Do I need to use streams to ensure that the kernel for, say, id1=1,id2=1 has finished before my for loop launches the kernel for id1=1,id2=2??? External Media

The programming guide seems to say kernel calls are asynchronus so return before the device has completed its threads, but further on it says (in the streams bit) kernel calls are assigned to the default stream 0??? :unsure:

Thanks for the help!

Wes.

you can just call your kernels. They get run after eachother, in the order they are submitted.