Anyone has thought about the possibility to put multiprocessors in group and run different tasks in each group of cores. This means to run the applications ( and so the kernels concurrently). For example, executables would only pick 16 cores out of 240 to run, so we can run several concurrently.
Context switching may help to utilize the GPUs better, but what i know is that only one kernel and one task could be done each time slice no matter how many cores you have.
yup, I am looking at Fermi spec at the moment. But wt I see is that
the concurrent kernels are from the same application only.
the blocks to SMs are scheduled by chip-level scheduler, not under user control
“On the Fermi architecture,
different kernels of the same CUDA context can execute concurrently, allowing maximum
utilization of GPU resources. Kernels from different application contexts can still run
sequentially with great efficiency thanks to the improved context switching performance.”
Concurrent kernels execution only benefits application with kernels can be launched in parallel.
I do that on my projects, at WARP-level and SM-level:
you have to have the different code and select which path you execute depending on your thread number :-)
For example on a WARP-level, I use one warp to do global-memory access : write back information from Shared memory, prefetch tasks infos from global memory into shared (maintaining two FIFO in shared memory), the other warps just do computing using registers and shared memory.
On a kernel-level, I dedicate sometime one SM to organization tasks, FIFO management and even MAPPED PINNED MEMORY exchange to enable real-time communication with host CPU :-)
The main problem is to use variable that have the same name to be sure registers won’t be allocated for each task (too much registers) so I call my registers rxx (r00 … r99) and use manual mapping with #define for needed variables. Not really the simplest way, but really efficient
Thanks for reply iPAX. I guess you are talking about a single problem set launch at the same time. What i am talking about here is application-level. Eg an excel xll call a CUDA MC application and another C++ CUDA app doing whatever visual simulation.
FERMI looks powerful but still can not run kernels from different application concurrently.