Multiple kernels in flight?

Is is currently possible to launch multiple kernels at once or is only 1 kernel active at any one time?

I’d like to fire off multiple blocks of independent kernels and wait for them all to complete.

Is there any underyling workload management?

Furthemore, what is the most efficient way of checking for completion of a block of work?

Multiple simultaneous kernels would be very useful, specially if we could access some sort of scheduling / resource allocation interface (such as allocating a number of multiprocessors to a given kernel, or to the rendering side).

I’m pretty sure that right now all CUDA calls are blocking. So it is not possible to launch multiple kernels simultaneously, and checking for completion of a block of work is as simple as waiting for the program to get beyond the kernel call…

What I would find useful is being able to upload/download data while a kernel is running on the GPU to hide I/O costs.

I also have a definite need to launch lots of kernels at once. I don’t see a need for the ability to manually partition my hardware resources (ie the number of multiprocessors given to a kernel). The reason I need this is that the number of threads per block is a static number currently. I have one algorithm I’m working with that could really benefit from a varying number of threads per block. I can see two situations that are easily implementable.

  1. Allow us to turn in an expanded grid to the kernel call, and for each block in a grid, we manually specify the thread configuration for that block
  2. Make kernel calls be optionally asynchronous, so I can issue a series of 1-block calls with the thread dimensions set to my liking - then supply a high-level synchronization primitive of some sort. It could be as simple as a cudaWaitForAllKernelsToComplete() call or a little fancier with a barrier system.

My only option at the current time is to set the thread configuration to the maximum number of threads required by ANY block in the grid. I can then use some conditional logic to let any unnecessary threads “die,” but this feels kludgy and wasteful, plus I lose the ability to __synchthreads() as easily.

Has anyone tried using multiple host threads, each of which makes a device function call?

Okay I have run some tests and it seems that you cannot have multiple kernels in fight even if they are spawned by different host threads. Basically I wrote a kernel with one thread that sits in a loop for about 5 seconds then returns. I then spawned multiple host threads that all called this kernel. The execution time of the host program increased linearly with the number of times the kernel was called: 5 seconds for 1 host thread, 10 seconds for 2 threads, 15 seconds for 3, etc…

It also seems that this is a limitation of the driver or underlying architecture rather than CUDA since calling a function causes the display to become “unresponsive” until it returns.

A third result that I have obtained is that you can allocate/deallocate memory from different host threads. So one host thread can potentially allocate all of the device memory, or deallocate memory that was allocated by another thread.

I think this is a very dangerous way to go.

Considering all the very sparce resources threads should deal with (register, shared memory etc…)

I can forecast troubles when trying to deal with that in dynamic because other kernels start and stop

processing their own data!.

Especialy because there is now way to ensure some kind os synchronisation between different kernels.

Do not forget that memory limitation on GPu is hard. That is, unlike CPU where allocating too much memory lead to disk cache and performances drop; on GPU you will probably just get a crash…

I think it is probably better for now to use differents GPU with the PLEX solution for example.

May be in the future it will be possible to have the functionality to partition the multi-processors pool.

Bu for me I would not like to see more burden put on the GPU subsystem to manage all those potential conflict if it means reducing the performance.

I read a post from Nvidia a couple of days ago saying that async launch & concurrent copy would be coming out in the next release in May (can’t find it now?)

Question: will the async launch have another parameter - # multiprocessors - so that multiples can be launched concurrently?

I have a requirement for multiple concurrent kernels as I cannot use more than about 128 SIMD threads (then collecting the results, selecting the best result and going around again). Since execution time can vary a factor of 8 for different runs I cannot efficiently load up a G80 unless I can separately utilise multiprocessors (or groups of).

The neatest solution for me would be to run concurrent contexts on subsets of multiprocessors, but that is not supported. My understanding of the architecture suggests this should work quite efficiently.

All this relies upon kernel completion notification working properly - see http://forums.nvidia.com/index.php?showtopic=28524

Nvidia any comment?

No, the next release won’t allow you to run multiple kernels concurrently.

Couldn’t you replace the two kernels that you want to execute concurrently (say, kernelA() and kernelB()) by a single kernel that does:

kernel()

{

    if (blockID < someNumber) {

        kernelA();

    }

    else {

        kernelB();

    }

}

?

Because the individual kernels will vary a lot in execution time I would need to be able to signal the host upon completion then go to sleep somehow (tell the board exec?) so that the results can be copied out and the next lot loaded in and then have the host release the block(s). I cannot see a way to put a block to sleep without burning resources and interfering with other blocks that have real work to do. Spinning on constant memory looked promising until I found one cannot twiddle it from the host side mid context. Then there is still the problem of completion notifications… (polling global memory would work but again waste resources) still grappling, any ideas?

Yes, that would be truly useful, has anyone tried this yet? (execute the kernel in one thread, then do a cudaMemCpy in another)

Yes. Doesn’t work. Returns rubbish or the machine hangs.

Peter

Concurrent copy (with one kernel!) is scheduled for the next release, that is why I am considering how it might be used to solve my problem.
Eric

Point 17 of the FAQ says that it is possible to run multiple CUDA applications at the same time. Point 18 says it is not possible to run multiple kernels at the same time (as validated by other members of this forum).

what is the difference between “cuda applications” and “cuda kernels?” I don’t get it.

thanks in advance.
best regards, christoph

A kernel is a function running on the GPU. A CUDA application is a program which holds a CUDA context so it can launch kernels. This way, two CUDA applications can be running concurrently in the system but there will be only one kernel running on the GPU at the same time.

I see. OK, Thanks!

Will there be a possibillity to launch multiple kernels in parallel on the device in a future release of CUDA?

Indeed this is an ultimately interesting question.

The reason for it to be answered sooner than later is that currently many of us evaluate the possibilities and plan for future algorithm archtectures.

Ability to run multiple kernels at once will definitely affect the architecture.

Any person from NVIDIA is able to comment on that? At least YES/NO will do fine.

How about “MAYBE”?

It’s something we’re considering, but there are a lot bigger fish to fry in the meantime, and enabling this would have ramifications on the API and programming model, so we have to move carefully.

Mark

“Maybe” is a great answer… at least for someone like me who does not plan to use CUDA for rocket science.

That sounds intruguing. Thanks

I doubht anyone is willing to comment on the nature of the “bigger fish” which is slated to be fried. :)

However, I beleive, that a kind of “best practices for CUDA programming” is needed. Although this doc should not disclose fish, it will still insure the maximal code reusability for the future SDK releases.