On a Fermi GPU, if each thread block has 16x16 threads, can anyone tell me how the 32 threads of a warp will be distributed? Will each warp cover two adjacent rows of 16 threads? That would seem logical, but I haven’t been able to find a definitive answer anywhere.
Further to this, obviously it will depend on the application somewhat, but in general on a Fermi GPU should blocks that are 32 threads wide give the best performance?
Thanks in advance.