Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c

At least read the bold, please!!

(This problem is a lot shorter and simpler than my lengthy post will make it look. Please skim it and see if you can help me out!)

I can’t seem to get concurrent kernel execution working right. I’m using CUDA 3.0 and a GTX480 (which supports concurrent kernels and is compute capability 2.0, for those unfamiliar). For emphasis, I am trying to get kernels to run concurrently, not trying to get data transfers to overlap with kernel executions.

The sdk example program “concurrentKernels” runs correctly.

But if we modify it just slightly, ie paste this over a good chunk of their existing code ->

unsigned int timer;

cutCreateTimer(&timer);

cutStartTimer();

for(int i = 0; i < 10; i ++ )

{

	cutResetTimer(timer);

	mykernel<<<nblocks,nthreads,0,stream[0]>>>(&d_A[0*nthreads],n);

	mykernel<<<nblocks,nthreads,0,stream[1]>>>(&d_A[1*nthreads],n);

	cudaThreadSynchronize();

	printf("%f\n", cutGetTimerValue(timer) );

}

cutDeleteTimer(timer);

getc(stdin);

return;

You can see from the times that only SOME of the kernels get executed concurrently, most are serialized. On my system, usually the first of the ten sets of launches will run in parallel, and the rest will be serial.

(execution times on my machine are 40, 80, 80, 80, 80, 80, 80, 80, 80, 80 ms). If I mess around with things enough, sometimes I can get one or two more sets to execute concurrently, never more than 3 of the 10.

BUT here’s the kicker. If we add a cudaEventRecord call back in before the launches:

unsigned int timer;

cutCreateTimer(&timer);

cutStartTimer();

for(int i = 0; i < 10; i ++ )

{

	cutResetTimer(timer);

	// V only line added!!!

	cudaEventRecord(start,0);

	mykernel<<<nblocks,nthreads,0,stream[0]>>>(&d_A[0*nthreads],n);

	mykernel<<<nblocks,nthreads,0,stream[1]>>>(&d_A[1*nthreads],n);

	cudaThreadSynchronize();

	printf("%f\n", cutGetTimerValue(timer) );

}

cutDeleteTimer(timer);

getc(stdin);

return;

suddenly everything runs in parallel like it should. Which makes absolutely no sense because the cudaEventRecord call should not be changing anything.

(execution times on my machine are 40, 40, 40, 40, 40, 40, 40, 40, 40, 40 ms)

Some other observations and notes on the situation:

- Information on concurrent kernel execution is extremely lacking.  It's hinted at several times in various manuals, but I only see it explicitly mentioned once.

- The only place I've seen them explicitly mention the possibility of launching DIFFERENT concurrent kernels is in the Fermi whitepaper, and, let's face it, a lot has changed since then.

- Running cudaprof on the unaltered concurrentKernels sdk example shows all the kernels serialized.  Does cudaprof somehow force this to help it in profiling?

- It seems like it's entirely up to the hardware to decide if it's going to be gracious enough to ACTUALLY run the kernels in parallel.  The only things I've ever seen hinting to this fact is in one of the programming guides, "Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels."  Of course, the simple kernel from the concurrentKernels example does not fall into this category, and it's launched with so few blocks that there's no way the GPU is nearing saturation, but it still gets serialized outside of the carefully controlled situation in the SDK.

- I've searched google and the forums.  Google results are too introductory, forum results are overwhelmed with too many questions that can be answered with a simple 'RTFM'.  So I really need some help here.

So my questions are:

- <b>Why won't this work?</b>

- Does anyone know of a detailed reference on running concurrent kernels?

- <b>Does anyone know how the GPU/driver/whoever schedules asynchronous kernel launches?</b>  (round robin, etc?) - I need some guarantee of execution times, if the GPU is going to haphazardly decide what to run concurrently, I need to find a different solution.

- Is there ANY way to try to "force" concurrent kernels?

- Why isn't there more information on this?  Certain algorithms simply can't completely occupy the entire GPU by themselves, isn't concurrent kernel execution a HUGE part of acheiving good performance for a lot of people?

What OS and driver are you using?

Thanks for your response - sorry about that, I tried to describe the problem thoroughly but forgot the most basic information:

- OS is windows 7 64 bit

- SDK and toolkit are 3.0 (CUDA driver is same)

- graphics driver is 197.41