I have been working on cuda for sometime and have started to learn some optimization techniques. Here are some few doubts
(1) Is a warp strictly selected as set of consecutive 32 threads, like (t0,t1,…t31) (t32,t33,…t63)…?
(2) Are the threads for a warp selected in such a way that it gets most global memory coalesced reads? or is it like you get least shared memory bank conflicts? which gets high priority?
(3) If (1) is true and you have nested branches in the kernel, isn’t that a huge performance hit?
I have heard about warp-based programming. Exactly how does it differ from thread-based, and advantages of it. Can someone pass a good link to it.
I have been working on cuda for sometime and have started to learn some optimization techniques. Here are some few doubts
(1) Is a warp strictly selected as set of consecutive 32 threads, like (t0,t1,…t31) (t32,t33,…t63)…?
(2) Are the threads for a warp selected in such a way that it gets most global memory coalesced reads? or is it like you get least shared memory bank conflicts? which gets high priority?
(3) If (1) is true and you have nested branches in the kernel, isn’t that a huge performance hit?
I have heard about warp-based programming. Exactly how does it differ from thread-based, and advantages of it. Can someone pass a good link to it.
Yes. The mapping of threadIdx.x to warps is explicitly documented in the CUDA C programming guide.
Neither. Threads are mapped to warps via a static assignment. It is up to you, the kernel programmer, to take advantage of that assignment to coalesce your reads and avoid bank conflicts. Prioritize memory coalescing much higher than conflict free smem access. See the CUDA best practices guide from NVIDIA for more information on what priority various optimizations should take.
Not necessarily. Much more often than not, I find that “optimizations” to remove branches simply make the kernel slower.
Yes. The mapping of threadIdx.x to warps is explicitly documented in the CUDA C programming guide.
Neither. Threads are mapped to warps via a static assignment. It is up to you, the kernel programmer, to take advantage of that assignment to coalesce your reads and avoid bank conflicts. Prioritize memory coalescing much higher than conflict free smem access. See the CUDA best practices guide from NVIDIA for more information on what priority various optimizations should take.
Not necessarily. Much more often than not, I find that “optimizations” to remove branches simply make the kernel slower.
(3) - Since (1) is true, It all depends on whether the branches cause “Warp Divergence” and whether you run that code heavily under a loop. If maximum kernel time is spent on divergent branches, performance will get hit.
(3) - Since (1) is true, It all depends on whether the branches cause “Warp Divergence” and whether you run that code heavily under a loop. If maximum kernel time is spent on divergent branches, performance will get hit.
In which order are the 24 warps in the SM selected and executed? I know its hardware scheduled to hide memory latency, but is there a way to know it just for curiosity…
In which order are the 24 warps in the SM selected and executed? I know its hardware scheduled to hide memory latency, but is there a way to know it just for curiosity…
The order of warp execution is completely non-deterministic, as warps will sleep while waiting for memory transactions. Warps are highly interleaved, i.e. a typical instruction pattern on the ALUs might look like:
The order of warp execution is completely non-deterministic, as warps will sleep while waiting for memory transactions. Warps are highly interleaved, i.e. a typical instruction pattern on the ALUs might look like: