hi
i have this code:
dim3 GridDimension = dim3(8);
dim3 BlockDimension = dim3(128);
KernelDoSomething<<<GridDimension, BlockDimension>>>(buffer);
the ‘KernelDoSomething’ READS and WRITES buffer, kernel is organized
that way that for blockIdx.x == 0 its threads WRITES first 32 floats to buffer, for blockIdx.x == 1 its threads READS the first 32 floats from buffer calculates something then WRITES next 32 floats, and so on.
My question is - it is safe to do so ? - do i have some guarantee that all threads from blockIdx.x == 0 are finished BEFORE any threads from blockIdx.x == 1 are started ?
Currently on my 8600 mobile all is fine, but i dont know it is valid for 8800GT or tesla or others cards.