How to run kernels in percentages of the GPU resources?

Hi!
Is it possible to apply a CUDA kernel to the GPU without using the full available resources?

For example I would like to run kernels using the 30%, 50% or 70% of the available resources.
How could I achieve it?

It really depends on the characteristics of your kernel. If your kernel is compute-bound, you can control the utility of the hardware by launching a small number of blocks/threads, so that only some of the SMs have workload. You need to run nvprof/nvvp to find out what is the active block size (determined by your register/shared memory use) and set the thread number/blocksize accordingly. You can verify using nvvp to see if you actually gets a partial occupancy.

However, this will unlikely to work if your memory overhead is significant - if there is any global memory read/write, the typical strategy is to launch a large number of threads to “saturate” the hardware resources, when some of the threads wait for data, other threads in the wailing list can be switched on and keep the cores busy. In this case, you may not be able to simulate a partial occupancy by launch only a small number of threads because there is no buffering load for the scheduler to run if someone is waiting for data.

In one of our papers published last year, we tried to launch “appropriate” number of threads/blocks so that it can “fully occupy” the hardware, in order to achieve “persistent threads”. see Fig 2 (Opt2)

https://doi.org/10.1117/1.JBO.23.1.010504

So we determine our thread/block sizes based on the SM/core counts for different processors, here is the code

https://github.com/fangq/mcxcl/blob/master/src/mcx_host.cpp#L357-L377

Thank you very much for your reply.

But I am looking for setting the limits of the gpu resources before the kernel is to be executed.
For example, given a GPU with 4 SMs and 128 cores per SM, I want to set as active SMs the 2 of them and active cores the 64 cores per SM. As a limit. I want to launch the same kernel in the same gpu with different resources. Is there any way to do this (in OpenCL or CUDA)?

Propably MPS is way to achieve it in CUDA, but what about in OpenCL?