How to limit number of CUDA Cores

Hi all,

I am using NVIDIA GPUs and CUDA for research proposes in our university.
The problem is that I need to limit the number of CUDA Cores using by my application.

Is it possible?

Any comments are appreciated!

Kind regards,

Launch a low number of thread blocks, and keep these blocks persistent until your entire computation finishes.

You will be running on only a subset of the available multiprocessors on the device (one block per multiprocessor).

Keep the number of threads per block quite high so that each of the multiprocessors has enough pending warps to hide memory access latencies.

Thanks for answering cbuchner1.

Following this example. My GPU have four multiprocessors.

But if I launching four blocks. Is it scheduled one for multiprocessor, sure?

Isn’t possible scheduled only for the first two multiprocessors (two blocks for each)?

Yours sincerely.

While the details of the work scheduling algorithm of CUDA remain undocumented (and slightly different between device generations), I believe that you should see just one block assigned to each multiprocessor.

There is a way to make absolutely sure that there’s just one block per multiprocessor.

Reserve an amount of shared memory per plock (or any other limited resource such as registers) so that according to the hardware capabilities no more than one block can execute on a multiprocessor.

Again, thanks for your reply.

Great idea to limit resources.

Yours sincerely.

If you think carefully about the use of smid, you can do even stranger stuff like load up N blocks on a single SM while all the other SMs are empty. In a nutshell:

  • Launch blocks of a single thread each. The number of blocks to launch is the max blocks per multiprocessor times the number of multiprocessors. (or alternatively you can experiment with N blocks * number of multiprocessors, if you want to load up N blocks, but there may be some variability in that case).

  • Each block (thread) reads its smid. If the smid is the desired one (let’s say 0), then the thread code spins forever (or, say, 30 minutes, or however long you need) using clock64(). If the smid is not the desired one, then the thread code spins for 1 second and exits.

After about a second, you should have one SM with a full complement of blocks (so no new blocks can launch to it), and the other SMs empty. Be sure to launch the above kernel in a particular non-default stream. The app that launched that kernel can then launch other kernels which should run concurrently on the remaining SMs, assuming you launch them to separate streams.

When you’re done with everything, you can just exit the process. That will kill the “persistent” kernel. Or you can explicitly call cudaDeviceReset().

It’s straightforward to extend the above technique to two or more SMs.

Don’t try this from separate processes. Activity from separate processes requires a context switch. I don’t think the context switch can occur while a “persistent” kernel is running.

How to read smid? Try this:

[url]https://devtalk.nvidia.com/default/topic/518634/execution-id/?offset=7[/url]

1 Like

Hi TxBob.

Are you implying that in a multithreaded application I can launch a permanently resident “worker block” on a single SMX of my GPU, while other threads can continue to use the remainder of the GPU - all that without expensive context switches?

If that is so, I have a use for this feature already. I have some small’ish matrix multiplications to compute, say 96x48 multiplied with a 48x48. These are not really large enough to warrant launching a full grid. But if I could offload that from the CPU to the GPU, this could be immensely helpful.

As we’ve recently ported our code from 32 bit to 64 bits and run Linux, we should even have UVA zero copy memory available so we can pass the matrices without explicit memory copies.

Yes, I’m implying that, as long as the kernel launches emanate from the same process, i.e. the same application. In modern versions of CUDA (post 4.0) multithreading does not imply multi-context. It’s really nothing other than a concurrent kernel scenario combined with a persistent kernel scenario.