Increased number of concurrent kernels for kepler? How many concurrent kernels can a kepler card lau


does somebody know if the tripled cuda core count for the kepler cards (formely 512 now 1536) means that we can launch three times more concurrent kernels. The fermi cards can schedule up to 16 kernels concurrently though I could hardly measure
a performance gain above 4 concurrent kernels. Being able to launch 48 (or even 12) independent compute kernels would just be awesome.


Kepler, or more precisely the GK104 of GTX680, is organized in 8 SMX with 192 cuda-core each. From this point of view you may be able to launch up to 8 kernels simultaneously, but these 8 SMX are grouped by 2 on what nVidia call GPC (so 4 GPC on GK104), and I suspect this is the base unit for Kernel execution, leaving only 4 slots to execute kernel concurrently on the GK104.

Read this nVidia PDF about GTX 680 Kepler architecture, it’s really interesting.

nah, I think it’s still 16. there’s no one-kernel-per-SM limitation (never has been).

Thanks for the excellent resources on the new architecture. I will read it in some more detail later. But on a first glance I have to admit that the number of SMX seems to be a natural limit on the number of concurrent kernels. That would still mean

an improvement by a factor of 2. I have to wait and see what the Tesla cards can do once they get released.


Tim, is there anything in the GK104 hardware that would make it possible to run kernels on just some of the SMXs (or GPCs) and leave the others free for other processes?

I’m thinking of the many powerful uses for that: running display on one SMX while the rest of the GPU runs CUDA (no more watchdog timeouts!). Or partitioning one GPU into smaller ones for server virtualization. Or having a process that runs one small persistent long-lived kernel on one SMX for a guaranteed low latency compute task while the rest of the GPU is still available for other processing like normal (awesome for embedded systems). Or testing how CUDA codes scale across variable numbers of SMXs.

Note I’m not asking about unannounced GK110 features, nor about existing driver support, I’m just asking if the now released GK104 hardware could possibly support such a thing… Fermi cannot.

nope, same limitations in terms of concurrent processes as Fermi.

Is it going to be easier to launch more than 4 concurrent kernels (cf. the slides from the StreamsAndConcurrencyWebinar.pdf)?

We are using a task parallel programming model and a single task can hardly use all of the SM of the current Fermi-Tesla cards.

It is realy hard to efficiently schedule concurrent kernels because of the compute engine queue. The limitation that kernels are dispatched to the hardware in the order they are launched,

although the stream parameter indicates that they are independent, is realy annoying. One has to make sure that the stream one wants to launch the next kernel is complete as it otherwise blocks the whole device.

But querying the execution of each kernel with a cudaEvent is way to expensive and prohibits concurrent execution; the same seems to be true for CudaStreamQuery.

So the best solution we came up so far is to record a timestamp for each stream containing the latest kernel launch. Before scheduling a new round of kernels we prioritize the streams according to the schedule time, i.e. oldest schedule time or streams that are known to be complete get the highest priority. I would expect that this could be done more efficient on the device itself.

For a lot of applications the cards are getting to big to utilize them with a single kernel. IMHO its time to enhance the mechanism for task parallel programming. So being able to launch >= 16 small independent kernels would realy be nice.

GTX 680 is a little better than Fermi in that regard, but not significantly.