warpsize's effect on performance?

the default warpsize is 32, my understanding is that this means threads in each block will run in groups, with size of 32 threads per group. If I launch a kernel with 10 blocks, 100 threads/block, the 100 threads in each block will be divided into 4 groups, the last group will only carry 4 threads: 32x3+4=100. Threads in different blocks won’t be grouped together. Am I right?

If I change the warpsize to a different number, say 20, this way the 100 threads will be divided evenly into 5 groups. Is there a performance gain or loss if I do this way? thanks.

Am I right?

Your first statement is correct. Actually the 4th group will also contain 32 threads, but only 4 threads will be marked as active threads and the other 28 threads will be marked as inactive. Thus, it is always advisable to use a multiple of 32 for the number of threads per block, because they will be instantiated anyway.

If I change the warpsize to a different number

You CANNOT change the warpsize. The warpsize is something that is defined by your GPU architecture (such as the maximal number of threads per block, for instance). At the moment NVIDIA defines a warpsize of 32, but as this is not a standard specification, this might change in future.

Also see the CUDA Fortran manual:

The variable warpsize contains the number of threads in a warp. It has > constant > value, currently defined to be 32.

I hope that helps, Sandra