the default warpsize is 32, my understanding is that this means threads in each block will run in groups, with size of 32 threads per group. If I launch a kernel with 10 blocks, 100 threads/block, the 100 threads in each block will be divided into 4 groups, the last group will only carry 4 threads: 32x3+4=100. Threads in different blocks won’t be grouped together. Am I right?
If I change the warpsize to a different number, say 20, this way the 100 threads will be divided evenly into 5 groups. Is there a performance gain or loss if I do this way? thanks.