I want to understand in CUDA program that if i create a block<<1,N>> , where N is width*width will create a block of N threads, this will have 1 block of N threads.
- Does this guarantee that this one block is mapped and executed on only one SM or can be across multiple SMs on the same GPU??
- Does this execution of threads is serialised or concurrent on CUDA kernels??
I am working on the tesla k80.
1 Like
Thanks a lot for the reply.
Going by this logic if i have availability of 10 SMs and i created 4 threaded blocks,
- will they be allotted 4 out of 10 SMs?? is it fixed 4 or not??
- Suppose i want to have 32 threaded blocks running on 10 SMs , how they will be scheduled ??
- will they be allotted 4 out of 10 SMs?? is it fixed 4 or not??
You will likely see these 4 thread blocks running on 4 SMs. But don’t make any assumptions about which SMs the block scheduler picks.
- Suppose i want to have 32 threaded blocks running on 10 SMs , how they will be scheduled ??
a) this is not documented, and may vary depending on GPU architecture (generation)
b) it also depends a lot on the the achievable occupancy of your kernel. It may be that one SM is capable of executing 2, 3, 4 or more blocks simultaneously - given the constraints of the registers per thread, shared memory requested and number of texture units used by each kernel allow for it. In this case you may see that all 32 thread blocks execute concurrently.