CUDA thread and SM

Hi,

I have 2 CUDA streams and 2 diferent kernels. For executing them in parallel I reduced number of threads in block and now I am seeing some parallel behaviar.

My question is about the number of threads that we have for each SM. Is it possible to have 1 SM with 512 thread and another SM with 256 thread? If maximum number of thread is 512 per SM, am I waste half of the thread for the second SM with 256 thread?

Also It is not clear for me the relation betwine number of CUDA kernel and number of CUDA thread and maximum of each.

You can have up to 2048 threads per SM on most GPUs up through the Volta architecture. The maximum number of threads per SM is discoverable using a tool like deviceQuery and is also reported in the programming guide, table 14.

There is no relationship between the number of CUDA kernels you decide to define, and the number of CUDA threads and maximum of each.

The number of CUDA threads that a kernel requires is referred to as the grid. You defined the grid at launch time. A subset of the grid will begin executing on SMs sometime after you launch your kernel(s).