As we know, GPU is well accepted as an SIMD engine that pacts all the threads in a manner that execute one instruction for multiple data at the same time. And it’s highly suggested that a warp of 32 threads not be broken up minimally for branching to avoid significant performance drop. I’m wondering can we, one level higher, break up a block into set of groups of threads which is multiple of warp size and dispatch different task to them individually?
For example, can we assign thread 0-63 read the data in, and thread 64-127 process function 1, and thread 128-255 process function 2 at the same time? How does this compared to the performance when we don’t split the block and let all threads process these functions sequentially.