How to limit number of CUDA Cores

MatheusSerpa · April 20, 2016, 7:28pm

Hi all,

I am using NVIDIA GPUs and CUDA for research proposes in our university.
The problem is that I need to limit the number of CUDA Cores using by my application.

Is it possible?

Any comments are appreciated!

Kind regards,

cbuchner1 · April 21, 2016, 9:17am

Launch a low number of thread blocks, and keep these blocks persistent until your entire computation finishes.

You will be running on only a subset of the available multiprocessors on the device (one block per multiprocessor).

Keep the number of threads per block quite high so that each of the multiprocessors has enough pending warps to hide memory access latencies.

MatheusSerpa · April 21, 2016, 12:24pm

Thanks for answering cbuchner1.

Following this example. My GPU have four multiprocessors.

But if I launching four blocks. Is it scheduled one for multiprocessor, sure?

Isn’t possible scheduled only for the first two multiprocessors (two blocks for each)?

Yours sincerely.

cbuchner1 · April 21, 2016, 4:09pm

While the details of the work scheduling algorithm of CUDA remain undocumented (and slightly different between device generations), I believe that you should see just one block assigned to each multiprocessor.

There is a way to make absolutely sure that there’s just one block per multiprocessor.

Reserve an amount of shared memory per plock (or any other limited resource such as registers) so that according to the hardware capabilities no more than one block can execute on a multiprocessor.

MatheusSerpa · April 22, 2016, 12:08am

Again, thanks for your reply.

Great idea to limit resources.

Yours sincerely.

Robert_Crovella · April 22, 2016, 3:54am

If you think carefully about the use of smid, you can do even stranger stuff like load up N blocks on a single SM while all the other SMs are empty. In a nutshell:

Launch blocks of a single thread each. The number of blocks to launch is the max blocks per multiprocessor times the number of multiprocessors. (or alternatively you can experiment with N blocks * number of multiprocessors, if you want to load up N blocks, but there may be some variability in that case).
Each block (thread) reads its smid. If the smid is the desired one (let’s say 0), then the thread code spins forever (or, say, 30 minutes, or however long you need) using clock64(). If the smid is not the desired one, then the thread code spins for 1 second and exits.

After about a second, you should have one SM with a full complement of blocks (so no new blocks can launch to it), and the other SMs empty. Be sure to launch the above kernel in a particular non-default stream. The app that launched that kernel can then launch other kernels which should run concurrently on the remaining SMs, assuming you launch them to separate streams.

When you’re done with everything, you can just exit the process. That will kill the “persistent” kernel. Or you can explicitly call cudaDeviceReset().

It’s straightforward to extend the above technique to two or more SMs.

Don’t try this from separate processes. Activity from separate processes requires a context switch. I don’t think the context switch can occur while a “persistent” kernel is running.

How to read smid? Try this:

[url]https://devtalk.nvidia.com/default/topic/518634/execution-id/?offset=7[/url]

cbuchner1 · April 22, 2016, 9:18am

Hi TxBob.

Are you implying that in a multithreaded application I can launch a permanently resident “worker block” on a single SMX of my GPU, while other threads can continue to use the remainder of the GPU - all that without expensive context switches?

If that is so, I have a use for this feature already. I have some small’ish matrix multiplications to compute, say 96x48 multiplied with a 48x48. These are not really large enough to warrant launching a full grid. But if I could offload that from the CPU to the GPU, this could be immensely helpful.

As we’ve recently ported our code from 32 bit to 64 bits and run Linux, we should even have UVA zero copy memory available so we can pass the matrices without explicit memory copies.

Robert_Crovella · April 22, 2016, 1:26pm

Yes, I’m implying that, as long as the kernel launches emanate from the same process, i.e. the same application. In modern versions of CUDA (post 4.0) multithreading does not imply multi-context. It’s really nothing other than a concurrent kernel scenario combined with a persistent kernel scenario.

Topic		Replies	Views
Which entity will execute one block? A single Cuda core or a SM? CUDA Programming and Performance	13	17141	December 7, 2010
How do the thread blocks resides in the multiprocessors? CUDA Programming and Performance	4	2037	April 16, 2012
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13627	June 8, 2011
Understand A Statement Made in Appendix C.3 in CUDA Programming Guide CUDA Programming and Performance	1	555	June 26, 2018
Concurrent Kernels On A Given Multiprocessor CUDA Programming and Performance	7	3037	May 30, 2012
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3860	August 23, 2017
number of threads and registers CUDA Programming and Performance	10	4893	March 14, 2008
Number of blocks parameter for kernel when GPU has just one SM CUDA Programming and Performance	3	524	August 4, 2017
Controlling Multiprocessor Usage? CUDA Programming and Performance	2	1483	April 3, 2009
Specify number of cores to use CUDA Programming and Performance	5	5113	November 4, 2010

How to limit number of CUDA Cores

Related topics