Threads and Blocks execution order

i have this code:

dim3 GridDimension = dim3(8);
dim3 BlockDimension = dim3(128);

KernelDoSomething<<<GridDimension, BlockDimension>>>(buffer);

the ‘KernelDoSomething’ READS and WRITES buffer, kernel is organized
that way that for blockIdx.x == 0 its threads WRITES first 32 floats to buffer, for blockIdx.x == 1 its threads READS the first 32 floats from buffer calculates something then WRITES next 32 floats, and so on.

My question is - it is safe to do so ? - do i have some guarantee that all threads from blockIdx.x == 0 are finished BEFORE any threads from blockIdx.x == 1 are started ?

Currently on my 8600 mobile all is fine, but i dont know it is valid for 8800GT or tesla or others cards.

No, it’s not safe. I guess when you’ll switch to something with > 128 SPs you’ll get problems.

You should not make any assumption about block ordering.

Actually block 0 and 1 should most of the times be executed in parallel, as one block is executed on one MP, and your GPU hast multiple of those.