I have been working on cuda for sometime and have started to learn some optimization techniques. Here are some few doubts
(1) Is a warp strictly selected as set of consecutive 32 threads, like (t0,t1,…t31) (t32,t33,…t63)…?
(2) Are the threads for a warp selected in such a way that it gets most global memory coalesced reads? or is it like you get least shared memory bank conflicts? which gets high priority?
(3) If (1) is true and you have nested branches in the kernel, isn’t that a huge performance hit?
I have heard about warp-based programming. Exactly how does it differ from thread-based, and advantages of it. Can someone pass a good link to it.