Yes, it might make sense. A single block has a maximum of 1024 threads. As you have shown, the max capacity for that SM is 2048 threads. This might also be called “maximum occupancy”, in this case. Therefore you should launch at least 2 blocks of 1024 threads, or more blocks if your threads per block is less.
Whether or not the difference between 1024 resident threads and 2048 resident threads makes a performance difference would be a function of your actual code, but in many cases it will make a perf difference, as this is a fundamental parameter that determines a GPU’s ability to hide latency.
Ah, you’re right – thanks txbob. The max number of threads per block is 1024. (I’ve just queried that from cudaDeviceProp).
When compiling for a much beefier GPU should my block creation strategy be to create at least as many blocks as there are SMs with the hope that the blocks will be distributed among all SMs? And, in general, do smaller blocks amount to better sharing of the workload between SMs? For example if I had 500 threads and 4 SMs but create blocks of size 100. There is a chance that one of the SMs would have two blocks while the rest have one each. Or is it futile to think about these things considering that the none of the details about how blocks are assigned to SMs has been made public.