Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c

shadowsill · June 17, 2010, 10:42pm

At least read the bold, please!!

(This problem is a lot shorter and simpler than my lengthy post will make it look. Please skim it and see if you can help me out!)

I can’t seem to get concurrent kernel execution working right. I’m using CUDA 3.0 and a GTX480 (which supports concurrent kernels and is compute capability 2.0, for those unfamiliar). For emphasis, I am trying to get kernels to run concurrently, not trying to get data transfers to overlap with kernel executions.

The sdk example program “concurrentKernels” runs correctly.

But if we modify it just slightly, ie paste this over a good chunk of their existing code →

unsigned int timer;

cutCreateTimer(&timer);

cutStartTimer();

for(int i = 0; i < 10; i ++ )

{

	cutResetTimer(timer);

	mykernel<<<nblocks,nthreads,0,stream[0]>>>(&d_A[0*nthreads],n);

	mykernel<<<nblocks,nthreads,0,stream[1]>>>(&d_A[1*nthreads],n);

	cudaThreadSynchronize();

	printf("%f\n", cutGetTimerValue(timer) );

}

cutDeleteTimer(timer);

getc(stdin);

return;

You can see from the times that only SOME of the kernels get executed concurrently, most are serialized. On my system, usually the first of the ten sets of launches will run in parallel, and the rest will be serial.

(execution times on my machine are 40, 80, 80, 80, 80, 80, 80, 80, 80, 80 ms). If I mess around with things enough, sometimes I can get one or two more sets to execute concurrently, never more than 3 of the 10.

BUT here’s the kicker. If we add a cudaEventRecord call back in before the launches:

unsigned int timer;

cutCreateTimer(&timer);

cutStartTimer();

for(int i = 0; i < 10; i ++ )

{

	cutResetTimer(timer);

	// V only line added!!!

	cudaEventRecord(start,0);

	mykernel<<<nblocks,nthreads,0,stream[0]>>>(&d_A[0*nthreads],n);

	mykernel<<<nblocks,nthreads,0,stream[1]>>>(&d_A[1*nthreads],n);

	cudaThreadSynchronize();

	printf("%f\n", cutGetTimerValue(timer) );

}

cutDeleteTimer(timer);

getc(stdin);

return;

suddenly everything runs in parallel like it should. Which makes absolutely no sense because the cudaEventRecord call should not be changing anything.

(execution times on my machine are 40, 40, 40, 40, 40, 40, 40, 40, 40, 40 ms)

Some other observations and notes on the situation:

- Information on concurrent kernel execution is extremely lacking.  It's hinted at several times in various manuals, but I only see it explicitly mentioned once.

- The only place I've seen them explicitly mention the possibility of launching DIFFERENT concurrent kernels is in the Fermi whitepaper, and, let's face it, a lot has changed since then.

- Running cudaprof on the unaltered concurrentKernels sdk example shows all the kernels serialized.  Does cudaprof somehow force this to help it in profiling?

- It seems like it's entirely up to the hardware to decide if it's going to be gracious enough to ACTUALLY run the kernels in parallel.  The only things I've ever seen hinting to this fact is in one of the programming guides, "Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels."  Of course, the simple kernel from the concurrentKernels example does not fall into this category, and it's launched with so few blocks that there's no way the GPU is nearing saturation, but it still gets serialized outside of the carefully controlled situation in the SDK.

- I've searched google and the forums.  Google results are too introductory, forum results are overwhelmed with too many questions that can be answered with a simple 'RTFM'.  So I really need some help here.

So my questions are:

- <b>Why won't this work?</b>

- Does anyone know of a detailed reference on running concurrent kernels?

- <b>Does anyone know how the GPU/driver/whoever schedules asynchronous kernel launches?</b>  (round robin, etc?) - I need some guarantee of execution times, if the GPU is going to haphazardly decide what to run concurrently, I need to find a different solution.

- Is there ANY way to try to "force" concurrent kernels?

- Why isn't there more information on this?  Certain algorithms simply can't completely occupy the entire GPU by themselves, isn't concurrent kernel execution a HUGE part of acheiving good performance for a lot of people?

tmurray · June 17, 2010, 11:21pm

What OS and driver are you using?

shadowsill · June 18, 2010, 12:49pm

Thanks for your response - sorry about that, I tried to describe the problem thoroughly but forgot the most basic information:

- OS is windows 7 64 bit

- SDK and toolkit are 3.0 (CUDA driver is same)

- graphics driver is 197.41

Topic		Replies	Views
Concurrent Kernel using GTX 570 on WinXp Concurrent Kernel CUDA Programming and Performance	0	673	October 10, 2011
Concurrent kernel execution CUDA Programming and Performance	2	335	March 26, 2024
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1245	November 19, 2010
Concurrent Kernel execution overhead CUDA Programming and Performance	4	925	March 20, 2011
Problematic multi GPU execution CUDA Programming and Performance	6	1986	June 12, 2012
My streams are not running concurrently CUDA Programming and Performance	7	1777	March 6, 2018
Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly? CUDA Programming and Performance cuda , kernel	6	39	March 16, 2025
Concurrent Kernel Execution on Fermi - confussion CUDA Programming and Performance	13	1649	October 10, 2011
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2664	April 29, 2019
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6639	July 2, 2010

Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c

Related topics