How to specific the number of SMs used in my program?

If I want to launch a kernel with N threads, How can I make the gpu scheduler allocate SMs to my programs as more as possible?

Other than via CUDA stream priorities, you have no control over the block scheduler in a GPU.

The heuristics of block scheduling are not published.

The GPU block scheduler will generally attempt to deliver blocks to SMs in such a way as to maximize throughput of your kernel. This generally means delivering blocks evenly to all available SMs.

You should strive for full occupancy of the GPU. As a target minimum, this means create kernels that contain at least 2048*(# of SMs in your GPU), total thread count (or more).