Anyone has thought about the possibility to put multiprocessors in group and run different tasks in each group of cores. This means to run the applications ( and so the kernels concurrently). For example, executables would only pick 16 cores out of 240 to run, so we can run several concurrently.
Context switching may help to utilize the GPUs better, but what i know is that only one kernel and one task could be done each time slice no matter how many cores you have.
Not sure if my idea make sense under the current CUDA architecture. In next generation things may be possible???http://www.nvidia.com/object/fermi_architecture.html