Increased number of concurrent kernels for kepler? How many concurrent kernels can a kepler card lau

plotzi76 · March 26, 2012, 9:54pm

Hi,

does somebody know if the tripled cuda core count for the kepler cards (formely 512 now 1536) means that we can launch three times more concurrent kernels. The fermi cards can schedule up to 16 kernels concurrently though I could hardly measure
a performance gain above 4 concurrent kernels. Being able to launch 48 (or even 12) independent compute kernels would just be awesome.

Cheers

parallelis · March 28, 2012, 6:58pm

Kepler, or more precisely the GK104 of GTX680, is organized in 8 SMX with 192 cuda-core each. From this point of view you may be able to launch up to 8 kernels simultaneously, but these 8 SMX are grouped by 2 on what nVidia call GPC (so 4 GPC on GK104), and I suspect this is the base unit for Kernel execution, leaving only 4 slots to execute kernel concurrently on the GK104.

Read this nVidia PDF about GTX 680 Kepler architecture, it’s really interesting.

tmurray · March 28, 2012, 9:18pm

nah, I think it’s still 16. there’s no one-kernel-per-SM limitation (never has been).

plotzi76 · March 28, 2012, 9:25pm

Thanks for the excellent resources on the new architecture. I will read it in some more detail later. But on a first glance I have to admit that the number of SMX seems to be a natural limit on the number of concurrent kernels. That would still mean

an improvement by a factor of 2. I have to wait and see what the Tesla cards can do once they get released.

Cheers

SPWorley · March 29, 2012, 5:24am

Tim, is there anything in the GK104 hardware that would make it possible to run kernels on just some of the SMXs (or GPCs) and leave the others free for other processes?

I’m thinking of the many powerful uses for that: running display on one SMX while the rest of the GPU runs CUDA (no more watchdog timeouts!). Or partitioning one GPU into smaller ones for server virtualization. Or having a process that runs one small persistent long-lived kernel on one SMX for a guaranteed low latency compute task while the rest of the GPU is still available for other processing like normal (awesome for embedded systems). Or testing how CUDA codes scale across variable numbers of SMXs.

Note I’m not asking about unannounced GK110 features, nor about existing driver support, I’m just asking if the now released GK104 hardware could possibly support such a thing… Fermi cannot.

tmurray · March 29, 2012, 4:07pm

nope, same limitations in terms of concurrent processes as Fermi.

plotzi76 · March 30, 2012, 9:41am

Is it going to be easier to launch more than 4 concurrent kernels (cf. the slides from the StreamsAndConcurrencyWebinar.pdf)?

We are using a task parallel programming model and a single task can hardly use all of the SM of the current Fermi-Tesla cards.

It is realy hard to efficiently schedule concurrent kernels because of the compute engine queue. The limitation that kernels are dispatched to the hardware in the order they are launched,

although the stream parameter indicates that they are independent, is realy annoying. One has to make sure that the stream one wants to launch the next kernel is complete as it otherwise blocks the whole device.

But querying the execution of each kernel with a cudaEvent is way to expensive and prohibits concurrent execution; the same seems to be true for CudaStreamQuery.

So the best solution we came up so far is to record a timestamp for each stream containing the latest kernel launch. Before scheduling a new round of kernels we prioritize the streams according to the schedule time, i.e. oldest schedule time or streams that are known to be complete get the highest priority. I would expect that this could be done more efficient on the device itself.

For a lot of applications the cards are getting to big to utilize them with a single kernel. IMHO its time to enhance the mechanism for task parallel programming. So being able to launch >= 16 small independent kernels would realy be nice.

tmurray · March 30, 2012, 9:58pm

GTX 680 is a little better than Fermi in that regard, but not significantly.

Topic		Replies	Views
Max 1 or 2 concurrent kernels per SM? CUDA Programming and Performance	19	11879	May 22, 2014
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13304	January 22, 2010
How concurrent kernel execution works on Fermi? CUDA Programming and Performance	6	24580	May 14, 2010
Number of concurrent kernel executions on GTX480 CUDA Programming and Performance	11	11461	June 27, 2010
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10698	June 26, 2012
Easiest way to invoke two different kernels simultaneously ? CUDA Programming and Performance	4	5812	April 12, 2012
Streaming Concurrent Kernels (in Fermi GPUs) ... CUDA Programming and Performance	2	1422	May 7, 2013
Kernels launch - parallel or serial? CUDA Programming and Performance	16	6999	January 11, 2010
What is the actual limit on simultaneously running threads? Asin, is it possible for more than one b CUDA Programming and Performance	20	2624	September 16, 2010
Multiple concurrent device processes using multiple concurrent host threads CUDA Programming and Performance	4	3800	January 26, 2009

Increased number of concurrent kernels for kepler? How many concurrent kernels can a kepler card lau

Related topics