What’s the strategy of GPU scheduler to allocate the SM to my program? Can I use some APIs to specific the number of SMs should be used in my program?
Other than via CUDA stream priorities, you have no control over the block scheduler in a GPU.
The heuristics of block scheduling are not published.
The GPU block scheduler will generally attempt to deliver blocks to SMs in such a way as to maximize throughput of your kernel. This generally means delivering blocks evenly to all available SMs.
You should strive for full occupancy of the GPU. As a target minimum, this means create kernels that contain at least 2048*(# of SMs in your GPU), total thread count (or more).
Thanks for your quick reply.
By the way, could you please explain why is 2048*# of SMs? Does this mean the number of threads per block is 2048?
Thanks again for your help.
OpenCL allows to divide GPU into sub-regions. But if your goal is to fill as much GPU as possible, it’s hardly of interest for you
One SM can execute up to 2048 threads. Additionally taking into account tail effect, optimal amount of threads is 20K or more per SM