confusion with warp selection

krishnaraj · June 13, 2011, 6:56pm

I have been working on cuda for sometime and have started to learn some optimization techniques. Here are some few doubts

(1) Is a warp strictly selected as set of consecutive 32 threads, like (t0,t1,…t31) (t32,t33,…t63)…?

(2) Are the threads for a warp selected in such a way that it gets most global memory coalesced reads? or is it like you get least shared memory bank conflicts? which gets high priority?

(3) If (1) is true and you have nested branches in the kernel, isn’t that a huge performance hit?

I have heard about warp-based programming. Exactly how does it differ from thread-based, and advantages of it. Can someone pass a good link to it.

Thanks :)

krishnaraj · June 13, 2011, 6:56pm

I have been working on cuda for sometime and have started to learn some optimization techniques. Here are some few doubts

(1) Is a warp strictly selected as set of consecutive 32 threads, like (t0,t1,…t31) (t32,t33,…t63)…?

(2) Are the threads for a warp selected in such a way that it gets most global memory coalesced reads? or is it like you get least shared memory bank conflicts? which gets high priority?

(3) If (1) is true and you have nested branches in the kernel, isn’t that a huge performance hit?

I have heard about warp-based programming. Exactly how does it differ from thread-based, and advantages of it. Can someone pass a good link to it.

Thanks :)

DrAnderson42 · June 14, 2011, 12:24am

Yes. The mapping of threadIdx.x to warps is explicitly documented in the CUDA C programming guide.

Neither. Threads are mapped to warps via a static assignment. It is up to you, the kernel programmer, to take advantage of that assignment to coalesce your reads and avoid bank conflicts. Prioritize memory coalescing much higher than conflict free smem access. See the CUDA best practices guide from NVIDIA for more information on what priority various optimizations should take.

Not necessarily. Much more often than not, I find that “optimizations” to remove branches simply make the kernel slower.

DrAnderson42 · June 14, 2011, 12:24am

Yes. The mapping of threadIdx.x to warps is explicitly documented in the CUDA C programming guide.

Neither. Threads are mapped to warps via a static assignment. It is up to you, the kernel programmer, to take advantage of that assignment to coalesce your reads and avoid bank conflicts. Prioritize memory coalescing much higher than conflict free smem access. See the CUDA best practices guide from NVIDIA for more information on what priority various optimizations should take.

Not necessarily. Much more often than not, I find that “optimizations” to remove branches simply make the kernel slower.

Sarnath · June 14, 2011, 7:25am

(3) - Since (1) is true, It all depends on whether the branches cause “Warp Divergence” and whether you run that code heavily under a loop. If maximum kernel time is spent on divergent branches, performance will get hit.

Sarnath · June 14, 2011, 7:25am

(3) - Since (1) is true, It all depends on whether the branches cause “Warp Divergence” and whether you run that code heavily under a loop. If maximum kernel time is spent on divergent branches, performance will get hit.

krishnaraj · June 14, 2011, 8:57am

got it :)

In which order are the 24 warps in the SM selected and executed? I know its hardware scheduled to hide memory latency, but is there a way to know it just for curiosity…

krishnaraj · June 14, 2011, 8:57am

got it :)

In which order are the 24 warps in the SM selected and executed? I know its hardware scheduled to hide memory latency, but is there a way to know it just for curiosity…

DrAnderson42 · June 14, 2011, 12:25pm

The order of warp execution is completely non-deterministic, as warps will sleep while waiting for memory transactions. Warps are highly interleaved, i.e. a typical instruction pattern on the ALUs might look like:

warp 10 - instruction 10

warp 3 - instruction 12

warp 5 - instruction 25, sleep for memory

warp 10 - instruction 11

warp 8  - instruction 98

......

warp 5 - memory arrives, instruction 26

DrAnderson42 · June 14, 2011, 12:25pm

The order of warp execution is completely non-deterministic, as warps will sleep while waiting for memory transactions. Warps are highly interleaved, i.e. a typical instruction pattern on the ALUs might look like:

warp 10 - instruction 10

warp 3 - instruction 12

warp 5 - instruction 25, sleep for memory

warp 10 - instruction 11

warp 8  - instruction 98

......

warp 5 - memory arrives, instruction 26

Topic		Replies	Views
Basic question about warps CUDA Programming and Performance	14	6571	June 9, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15556	February 4, 2011
questions about thread execution & volatile CUDA Programming and Performance	19	16894	December 29, 2008
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12107	February 12, 2013
Can one warp be doing one thing while another warp does something else? CUDA Programming and Performance	6	797	July 11, 2017
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28635	July 4, 2019
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4489	October 24, 2008
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU CUDA Programming and Performance	5	2571	November 1, 2012
Thread Scheduling Concept CUDA Programming and Performance	3	3680	June 21, 2012
Warp thread Scheduling CUDA Programming and Performance	7	2243	June 28, 2010

confusion with warp selection

Related topics