Multiple kernels in flight?

Gerrion · February 22, 2007, 1:47pm

Is is currently possible to launch multiple kernels at once or is only 1 kernel active at any one time?

I’d like to fire off multiple blocks of independent kernels and wait for them all to complete.

Is there any underyling workload management?

Furthemore, what is the most efficient way of checking for completion of a block of work?

hufo · February 22, 2007, 7:36pm

Multiple simultaneous kernels would be very useful, specially if we could access some sort of scheduling / resource allocation interface (such as allocating a number of multiprocessors to a given kernel, or to the rendering side).

eelsen · February 22, 2007, 7:41pm

I’m pretty sure that right now all CUDA calls are blocking. So it is not possible to launch multiple kernels simultaneously, and checking for completion of a block of work is as simple as waiting for the program to get beyond the kernel call…

What I would find useful is being able to upload/download data while a kernel is running on the GPU to hide I/O costs.

JohnH · February 22, 2007, 8:34pm

I also have a definite need to launch lots of kernels at once. I don’t see a need for the ability to manually partition my hardware resources (ie the number of multiprocessors given to a kernel). The reason I need this is that the number of threads per block is a static number currently. I have one algorithm I’m working with that could really benefit from a varying number of threads per block. I can see two situations that are easily implementable.

Allow us to turn in an expanded grid to the kernel call, and for each block in a grid, we manually specify the thread configuration for that block
Make kernel calls be optionally asynchronous, so I can issue a series of 1-block calls with the thread dimensions set to my liking - then supply a high-level synchronization primitive of some sort. It could be as simple as a cudaWaitForAllKernelsToComplete() call or a little fancier with a barrier system.

My only option at the current time is to set the thread configuration to the maximum number of threads required by ANY block in the grid. I can then use some conditional logic to let any unnecessary threads “die,” but this feels kludgy and wasteful, plus I lose the ability to __synchthreads() as easily.

GregD · February 25, 2007, 3:51am

Has anyone tried using multiple host threads, each of which makes a device function call?

GregD · February 27, 2007, 10:04am

Okay I have run some tests and it seems that you cannot have multiple kernels in fight even if they are spawned by different host threads. Basically I wrote a kernel with one thread that sits in a loop for about 5 seconds then returns. I then spawned multiple host threads that all called this kernel. The execution time of the host program increased linearly with the number of times the kernel was called: 5 seconds for 1 host thread, 10 seconds for 2 threads, 15 seconds for 3, etc…

It also seems that this is a limitation of the driver or underlying architecture rather than CUDA since calling a function causes the display to become “unresponsive” until it returns.

A third result that I have obtained is that you can allocate/deallocate memory from different host threads. So one host thread can potentially allocate all of the device memory, or deallocate memory that was allocated by another thread.

B_Deschizeaux · February 28, 2007, 10:07am

Okay I have run some tests and it seems that you cannot have multiple kernels in fight even if they are spawned by different host threads. Basically I wrote a kernel with one thread that sits in a loop for about 5 seconds then returns. I then spawned multiple host threads that all called this kernel. The execution time of the host program increased linearly with the number of times the kernel was called: 5 seconds for 1 host thread, 10 seconds for 2 threads, 15 seconds for 3, etc…

It also seems that this is a limitation of the driver or underlying architecture rather than CUDA since calling a function causes the display to become “unresponsive” until it returns.

A third result that I have obtained is that you can allocate/deallocate memory from different host threads. So one host thread can potentially allocate all of the device memory, or deallocate memory that was allocated by another thread.

[snapback]164631[/snapback]

I think this is a very dangerous way to go.

Considering all the very sparce resources threads should deal with (register, shared memory etc…)

I can forecast troubles when trying to deal with that in dynamic because other kernels start and stop

processing their own data!.

Especialy because there is now way to ensure some kind os synchronisation between different kernels.

Do not forget that memory limitation on GPu is hard. That is, unlike CPU where allocating too much memory lead to disk cache and performances drop; on GPU you will probably just get a crash…

I think it is probably better for now to use differents GPU with the PLEX solution for example.

May be in the future it will be possible to have the functionality to partition the multi-processors pool.

Bu for me I would not like to see more burden put on the GPU subsystem to manage all those potential conflict if it means reducing the performance.

osiris1 · April 21, 2007, 1:45am

I read a post from Nvidia a couple of days ago saying that async launch & concurrent copy would be coming out in the next release in May (can’t find it now?)

Question: will the async launch have another parameter - # multiprocessors - so that multiples can be launched concurrently?

I have a requirement for multiple concurrent kernels as I cannot use more than about 128 SIMD threads (then collecting the results, selecting the best result and going around again). Since execution time can vary a factor of 8 for different runs I cannot efficiently load up a G80 unless I can separately utilise multiprocessors (or groups of).

The neatest solution for me would be to run concurrent contexts on subsets of multiprocessors, but that is not supported. My understanding of the architecture suggests this should work quite efficiently.

All this relies upon kernel completion notification working properly - see [url=“http://forums.nvidia.com/index.php?showtopic=28524”]http://forums.nvidia.com/index.php?showtopic=28524[/url]

Nvidia any comment?

Cyril_Zeller · April 21, 2007, 6:15am

No, the next release won’t allow you to run multiple kernels concurrently.

Couldn’t you replace the two kernels that you want to execute concurrently (say, kernelA() and kernelB()) by a single kernel that does:

kernel()

{

    if (blockID < someNumber) {

        kernelA();

    }

    else {

        kernelB();

    }

}

?

osiris1 · April 23, 2007, 9:48am

No, the next release won’t allow you to run multiple kernels concurrently.

Couldn’t you replace the two kernels that you want to execute concurrently (say, kernelA() and kernelB()) by a single kernel that does:
kernel()

{

    if (blockID < someNumber) {

        kernelA();

    }

    else {

        kernelB();

    }

}
?

[snapback]187861[/snapback]

Because the individual kernels will vary a lot in execution time I would need to be able to signal the host upon completion then go to sleep somehow (tell the board exec?) so that the results can be copied out and the next lot loaded in and then have the host release the block(s). I cannot see a way to put a block to sleep without burning resources and interfering with other blocks that have real work to do. Spinning on constant memory looked promising until I found one cannot twiddle it from the host side mid context. Then there is still the problem of completion notifications… (polling global memory would work but again waste resources) still grappling, any ideas?

wumpus · April 23, 2007, 1:06pm

Yes, that would be truly useful, has anyone tried this yet? (execute the kernel in one thread, then do a cudaMemCpy in another)

prkipfer · April 23, 2007, 4:33pm

Yes. Doesn’t work. Returns rubbish or the machine hangs.

Peter

osiris1 · April 24, 2007, 7:01am

Concurrent copy (with one kernel!) is scheduled for the next release, that is why I am considering how it might be used to solve my problem.
Eric

christoph · June 19, 2007, 6:53am

Point 17 of the FAQ says that it is possible to run multiple CUDA applications at the same time. Point 18 says it is not possible to run multiple kernels at the same time (as validated by other members of this forum).

what is the difference between “cuda applications” and “cuda kernels?” I don’t get it.

thanks in advance.
best regards, christoph

javier1 · June 19, 2007, 7:39am

A kernel is a function running on the GPU. A CUDA application is a program which holds a CUDA context so it can launch kernels. This way, two CUDA applications can be running concurrently in the system but there will be only one kernel running on the GPU at the same time.

christoph · June 20, 2007, 6:16am

I see. OK, Thanks!

Will there be a possibillity to launch multiple kernels in parallel on the device in a future release of CUDA?

AlexIV · August 17, 2007, 7:34pm

Indeed this is an ultimately interesting question.

The reason for it to be answered sooner than later is that currently many of us evaluate the possibilities and plan for future algorithm archtectures.

Ability to run multiple kernels at once will definitely affect the architecture.

Any person from NVIDIA is able to comment on that? At least YES/NO will do fine.

Mark_Harris · August 23, 2007, 1:12pm

How about “MAYBE”?

It’s something we’re considering, but there are a lot bigger fish to fry in the meantime, and enabling this would have ramifications on the API and programming model, so we have to move carefully.

Mark

Aleksey_Vaneev · August 28, 2007, 8:44am

“Maybe” is a great answer… at least for someone like me who does not plan to use CUDA for rocket science.

AlexIV · August 28, 2007, 11:45am

That sounds intruguing. Thanks

I doubht anyone is willing to comment on the nature of the “bigger fish” which is slated to be fried. :)

However, I beleive, that a kind of “best practices for CUDA programming” is needed. Although this doc should not disclose fish, it will still insure the maximal code reusability for the future SDK releases.

Topic		Replies	Views
CUDA processor allocation CUDA Programming and Performance	7	3434	October 5, 2007
Simultaneous kernel executions not possible? Disappointing news for me CUDA Programming and Performance	7	6096	November 3, 2008
CUDA 1.0 Asynchronous Launches CUDA Programming and Performance	10	9435	June 29, 2007
Multi kernels CUDA Programming and Performance	13	4658	July 3, 2007
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20118	May 4, 2007
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4567	February 6, 2009
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17405	June 14, 2008
Threaded CUDA Multiple concurrent kernels? CUDA Programming and Performance	9	5593	October 20, 2009
CUDA thread in background? CUDA Programming and Performance	10	15984	February 19, 2010
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4190	May 13, 2010

Multiple kernels in flight?

Related topics