Number of blocks parameter for kernel when GPU has just one SM

This is somewhat a beginner question. Please bear in mind that I am still quite new to CUDA programming.

I’ve got a Quadro K1000M in my laptop and when I queried the specs with cudaGetDeviceProperties() I got:

Max threads per SM: 2048 -- Num multiprocessors: 1

Given that my GPU has just one SM does it make sense to generate more than 1 block when running my kernel, i.e.

fancySchmancyKernel<<<1,N>>>()

Considering that the kernel does not use shared memory as threads do not need to communicate.

Yes, it might make sense. A single block has a maximum of 1024 threads. As you have shown, the max capacity for that SM is 2048 threads. This might also be called “maximum occupancy”, in this case. Therefore you should launch at least 2 blocks of 1024 threads, or more blocks if your threads per block is less.

Whether or not the difference between 1024 resident threads and 2048 resident threads makes a performance difference would be a function of your actual code, but in many cases it will make a perf difference, as this is a fundamental parameter that determines a GPU’s ability to hide latency.

Ah, you’re right – thanks txbob. The max number of threads per block is 1024. (I’ve just queried that from cudaDeviceProp).

When compiling for a much beefier GPU should my block creation strategy be to create at least as many blocks as there are SMs with the hope that the blocks will be distributed among all SMs? And, in general, do smaller blocks amount to better sharing of the workload between SMs? For example if I had 500 threads and 4 SMs but create blocks of size 100. There is a chance that one of the SMs would have two blocks while the rest have one each. Or is it futile to think about these things considering that the none of the details about how blocks are assigned to SMs has been made public.

https://users.ices.utexas.edu/~sreepai/fermi-tbs/ This link might shed some light on how thread blocks are distributed.