I have a code that works when the block size is greater or equal to the number of threads I want to run. i.e. if I only run 1 block.
Unfortunately (for me :( ) if I change the block size so that more than one block runs my code doesn’t converge on an answer.
I’m sure this is because of a race condition, but I cant spot what I’m doing wrong.
I have the following…
I’m finding that I get the correct results if I have one block, but more than one block gives errors.
I thought I’d avoided the race condition caused by multiple threads of id1 and id2 all mapping back to x1[i] by using #pragma unroll
there is no defined scheduling in between blocks, therefore the date you are reading after the syncthreads is probably not even written by the other blocks. __synchthreads only syncs threads within your current block. However even within a block I would not rely on global memory read after writes between threads.
I didn’t realise that __syncthreads() only applied to a block.
I’m not sure that explains what I’m seeing though???
I thought that a thread would be launched and work on one data element (say in my case x1[5])
If that is the case then surly it doesn’t matter in what order they are executed, only that all threads are finished in a block when I try to operate on them all???
What am I missing??? :huh:
If threads in block 1 depend on outcome of threads in block 0, then you need to make sure that block 0 has finished. The only way to do so is a new kernel call.
I don’t really grok your code enough to be able to suggest some other way to do it (if there exists a way)
This is what I thought you meant. So that’s what I’ve done. :)
Do I need to use streams to ensure that the kernel for, say, id1=1,id2=1 has finished before my for loop launches the kernel for id1=1,id2=2??? External Media
The programming guide seems to say kernel calls are asynchronus so return before the device has completed its threads, but further on it says (in the streams bit) kernel calls are assigned to the default stream 0??? :unsure: