Multiple simultaneous kernels would be very useful, specially if we could access some sort of scheduling / resource allocation interface (such as allocating a number of multiprocessors to a given kernel, or to the rendering side).
I’m pretty sure that right now all CUDA calls are blocking. So it is not possible to launch multiple kernels simultaneously, and checking for completion of a block of work is as simple as waiting for the program to get beyond the kernel call…
What I would find useful is being able to upload/download data while a kernel is running on the GPU to hide I/O costs.
I also have a definite need to launch lots of kernels at once. I don’t see a need for the ability to manually partition my hardware resources (ie the number of multiprocessors given to a kernel). The reason I need this is that the number of threads per block is a static number currently. I have one algorithm I’m working with that could really benefit from a varying number of threads per block. I can see two situations that are easily implementable.
Allow us to turn in an expanded grid to the kernel call, and for each block in a grid, we manually specify the thread configuration for that block
Make kernel calls be optionally asynchronous, so I can issue a series of 1-block calls with the thread dimensions set to my liking - then supply a high-level synchronization primitive of some sort. It could be as simple as a cudaWaitForAllKernelsToComplete() call or a little fancier with a barrier system.
My only option at the current time is to set the thread configuration to the maximum number of threads required by ANY block in the grid. I can then use some conditional logic to let any unnecessary threads “die,” but this feels kludgy and wasteful, plus I lose the ability to __synchthreads() as easily.
Okay I have run some tests and it seems that you cannot have multiple kernels in fight even if they are spawned by different host threads. Basically I wrote a kernel with one thread that sits in a loop for about 5 seconds then returns. I then spawned multiple host threads that all called this kernel. The execution time of the host program increased linearly with the number of times the kernel was called: 5 seconds for 1 host thread, 10 seconds for 2 threads, 15 seconds for 3, etc…
It also seems that this is a limitation of the driver or underlying architecture rather than CUDA since calling a function causes the display to become “unresponsive” until it returns.
A third result that I have obtained is that you can allocate/deallocate memory from different host threads. So one host thread can potentially allocate all of the device memory, or deallocate memory that was allocated by another thread.
Considering all the very sparce resources threads should deal with (register, shared memory etc…)
I can forecast troubles when trying to deal with that in dynamic because other kernels start and stop
processing their own data!.
Especialy because there is now way to ensure some kind os synchronisation between different kernels.
Do not forget that memory limitation on GPu is hard. That is, unlike CPU where allocating too much memory lead to disk cache and performances drop; on GPU you will probably just get a crash…
I think it is probably better for now to use differents GPU with the PLEX solution for example.
May be in the future it will be possible to have the functionality to partition the multi-processors pool.
Bu for me I would not like to see more burden put on the GPU subsystem to manage all those potential conflict if it means reducing the performance.
I read a post from Nvidia a couple of days ago saying that async launch & concurrent copy would be coming out in the next release in May (can’t find it now?)
Question: will the async launch have another parameter - # multiprocessors - so that multiples can be launched concurrently?
I have a requirement for multiple concurrent kernels as I cannot use more than about 128 SIMD threads (then collecting the results, selecting the best result and going around again). Since execution time can vary a factor of 8 for different runs I cannot efficiently load up a G80 unless I can separately utilise multiprocessors (or groups of).
The neatest solution for me would be to run concurrent contexts on subsets of multiprocessors, but that is not supported. My understanding of the architecture suggests this should work quite efficiently.
Because the individual kernels will vary a lot in execution time I would need to be able to signal the host upon completion then go to sleep somehow (tell the board exec?) so that the results can be copied out and the next lot loaded in and then have the host release the block(s). I cannot see a way to put a block to sleep without burning resources and interfering with other blocks that have real work to do. Spinning on constant memory looked promising until I found one cannot twiddle it from the host side mid context. Then there is still the problem of completion notifications… (polling global memory would work but again waste resources) still grappling, any ideas?
Point 17 of the FAQ says that it is possible to run multiple CUDA applications at the same time. Point 18 says it is not possible to run multiple kernels at the same time (as validated by other members of this forum).
what is the difference between “cuda applications” and “cuda kernels?” I don’t get it.
A kernel is a function running on the GPU. A CUDA application is a program which holds a CUDA context so it can launch kernels. This way, two CUDA applications can be running concurrently in the system but there will be only one kernel running on the GPU at the same time.
It’s something we’re considering, but there are a lot bigger fish to fry in the meantime, and enabling this would have ramifications on the API and programming model, so we have to move carefully.
I doubht anyone is willing to comment on the nature of the “bigger fish” which is slated to be fried. :)
However, I beleive, that a kind of “best practices for CUDA programming” is needed. Although this doc should not disclose fish, it will still insure the maximal code reusability for the future SDK releases.