sequence of thread executions are there any guarantees?

Hi guys,

Let’s assume I’m running my kernel on a grid like this: <N, 1, 1>. Each block has 2^m threads (let’s say it’s equal to warp size). Is there any guarantee that the threads running simultaneously are sequential threads? I.e. calculating their “global ID” will be increasing with each threadIdx.x? I consider using shared memory and would like neighbouring threads to load a chunk of data which their neighbours will be able to use as well. But I want to be sure that the block of size 2^m running in parallel consists of sequential threads.


From the programming guide:

There’s only a ‘global ID’ for each thread within a block, not within a grid of blocks because different blocks can be executed out of order.


Am I correct that you are asking whether all threads in the same warp will have sequential thread ids?

If so, the answer is yes on current GPUs. However, the programming guide and PTX specifications do not mention this, and in general there are some optimizations that you can do if you are allowed to dynamically assign threads in the same block to different warps, so there is nothing stopping NVIDIA from changing this behavior in the future. As a future proof solution, you could do a __syncthreads() after each update to shared memory, ut that would incur some performance overhead.

In the future I wish that NVIDIA would introduce a “local” barrier instruction that would only block a subset of threads within a block that was specified by the programmer. For example __localBarrier(0, 31) would only block threads 0 through 31. It would allow programmers to specify exactly the behavior that you describe in a safe way, the compiler/hardware could reorder threads as long as the semantics of the barrier were maintained, and the instruction could be optimized out by the JIT if the threads being synced belonged to the same warp.

Actually, it has been mentioned in the programming guide (section 5.4).

My guess is that this implies sequential thread IDs, how else would you be able to know when to omit syncthreads() for better performance?