Blocks execution Are they executed concurrently?

Hi there

i have been trying to resolve this matter by reading books and articles, but not quite clear yet.

What is executed in parallel in cuda. I understand that the unit of execution is 32 threads (a warp ), which in Fermi is be 48. In parallel.

So a block pulses warps sequentially ie… warp 1, warp 2, …warp n and asynchronously, not necessarily 1,2…n

What about the blocks? Are they executed in parallel or in an asynchronous sequence?

Please help. The reason i am asking is because i need to update an array, where different blocks, might be accessing the same address…External Image

Best,
Than

I think blocks gets launched in order, but don’t necessarily process in order. There’s no way to synchronize data between blocks except for the atomic operations. Generally, you have to finish all threads of all blocks, write the results to global memory, then start a new kernel call to utilize the new data (unless you can do it via atomics).

You can make no assumptions about the order of block execution. The hardware may execute them in any order.

this is a fundamental truth, there are any number of reasons why ordering of blocks may not match your expectations.

In parallel up to the number of blocks that can run at once (depends on the model of GPU and how you decide to code your solution ), after that the blocks waiting have to wait until one of the running blocks finishes up.

So if your application has 10000 blocks and your GPU can run say 36 of those at once, then 36 will be launched and 9964 will be wait, when some of the 1st 36 finish then a similar number of the waiting ones will start to replace them. The original 36 will probably not finish all at the same time or in order. So “make no assumptions about the order of block execution”