Is there any way to find last thread block that finish kernel execution ?

Hi everyone,

I checked the forum but I haven’t seen this question.

I wonder is there any way to find last thread block that finish kernel execution ? I’d like to do some extra work by using last block before leaving kernel.

Thank you in advance

Use the method covered in the threadFence reduction cuda sample code (threadblock - draining)

http://docs.nvidia.com/cuda/cuda-samples/index.html#threadfencereduction

In a nutshell, you will initialize a global int value to zero. One thread from each threadblock will increment this value using atomicAdd. If the return value from atomicAdd indicates that the block doing the increment is the last block, then run your last-block code.