Let’s assume I’m running my kernel on a grid like this: <N, 1, 1>. Each block has 2^m threads (let’s say it’s equal to warp size). Is there any guarantee that the threads running simultaneously are sequential threads? I.e. calculating their â€œglobal IDâ€ will be increasing with each threadIdx.x? I consider using shared memory and would like neighbouring threads to load a chunk of data which their neighbours will be able to use as well. But I want to be sure that the block of size 2^m running in parallel consists of sequential threads.
Am I correct that you are asking whether all threads in the same warp will have sequential thread ids?
If so, the answer is yes on current GPUs. However, the programming guide and PTX specifications do not mention this, and in general there are some optimizations that you can do if you are allowed to dynamically assign threads in the same block to different warps, so there is nothing stopping NVIDIA from changing this behavior in the future. As a future proof solution, you could do a __syncthreads() after each update to shared memory, ut that would incur some performance overhead.
In the future I wish that NVIDIA would introduce a “local” barrier instruction that would only block a subset of threads within a block that was specified by the programmer. For example __localBarrier(0, 31) would only block threads 0 through 31. It would allow programmers to specify exactly the behavior that you describe in a safe way, the compiler/hardware could reorder threads as long as the semantics of the barrier were maintained, and the instruction could be optimized out by the JIT if the threads being synced belonged to the same warp.