putting multiprocessors in group

Anyone has thought about the possibility to put multiprocessors in group and run different tasks in each group of cores. This means to run the applications ( and so the kernels concurrently). For example, executables would only pick 16 cores out of 240 to run, so we can run several concurrently.

Context switching may help to utilize the GPUs better, but what i know is that only one kernel and one task could be done each time slice no matter how many cores you have.

Not sure if my idea make sense under the current CUDA architecture. In next generation things may be possible???http://www.nvidia.com/object/fermi_architecture.html

Yes, NVIDIA did - Fermi is supposed to be able to do exactly this. But the current generations of CUDA capable GPUs cannot.

yup, I am looking at Fermi spec at the moment. But wt I see is that

  1. the concurrent kernels are from the same application only.
  2. the blocks to SMs are scheduled by chip-level scheduler, not under user control

“On the Fermi architecture,
different kernels of the same CUDA context can execute concurrently, allowing maximum
utilization of GPU resources. Kernels from different application contexts can still run
sequentially with great efficiency thanks to the improved context switching performance.”

Concurrent kernels execution only benefits application with kernels can be launched in parallel.

I do that on my projects, at WARP-level and SM-level:
you have to have the different code and select which path you execute depending on your thread number :-)

For example on a WARP-level, I use one warp to do global-memory access : write back information from Shared memory, prefetch tasks infos from global memory into shared (maintaining two FIFO in shared memory), the other warps just do computing using registers and shared memory.

On a kernel-level, I dedicate sometime one SM to organization tasks, FIFO management and even MAPPED PINNED MEMORY exchange to enable real-time communication with host CPU :-)

The main problem is to use variable that have the same name to be sure registers won’t be allocated for each task (too much registers) so I call my registers rxx (r00 … r99) and use manual mapping with #define for needed variables. Not really the simplest way, but really efficient

Thanks for reply iPAX. I guess you are talking about a single problem set launch at the same time. What i am talking about here is application-level. Eg an excel xll call a CUDA MC application and another C++ CUDA app doing whatever visual simulation.

FERMI looks powerful but still can not run kernels from different application concurrently.

It is intended for HPC workloads. Can you really foresee a situation where you would want multiple HPC applications on the device simultaneously?

No, but I can easily foresee a situation where it is not running any HPC workload at all. Like here in my bedroom studio …