One possible reason is an attempt to maximize occupancy. Roughly defined, occupancy is the number of threads that are resident and executing on a SM. Occupancy may have a number of limiting factors, and the CUDA toolkit ships with an occupancy calculator spreadsheet to help you calculate possible occupancy for a particular code.
One limiter to occupancy can be register usage. Each GPU SM has a limited number of registers, in many cases 65536. Let’s also keep in mind that the maximum number of threads that a SM can sustain for execution is 2048. Achieving 2048 could be called 100% occupancy, and this can be considered an upper bound. Higher occupancy may lead to higher performance for some codes.
Suppose I have a code that uses 36 registers per thread. If I have a threadblock of 1024 threads, then 361024 registers are needed to support the execution of that threadblock. The SM has 65536 registers, so that works. But if I wanted to launch another threadblock on the same SM, I would need another 361024 registers. For that, I don’t have enough. So the maximum occupancy in this scenario would be 1024 threads, out of the maximum of 2048, or 50%.
Now suppose the theadblock size is 512, with no other changes. Each threadblock needs 512*36 registers. In this situation, I could support 3 threadblocks per SM, before running out of registers. This gives me an occupancy of 1536 threads, or 75%. It’s not guaranteed to be true, but in many cases this higher occupancy can lead to higher overall performance.
Shared memory usage (if any) by the kernel code, can be another limiting factor to occupancy, and in some cases it can have a similar effect as is described above for register usage.
Therefore, in some cases, smaller threadblock size can lead to higher overall performance.