Kernel execution blocks CPU code


I’ve got a problem concerning kernel launches. In all documentations, they are said to be asynchronous. But for me, this doesn’t seem to be true:

[codebox]printf(“Kernel launch: %d blocks x %d threads, %d bytes shared mem/block\n”, dimGrid.x, dimBlock.x, s_mem_size);

encode_cb_kernel<<< dimGrid, dimBlock, s_mem_size >>>

	(rawcbs_d, cbs_d, n_cbs, slope_max_d, pic->xSize, mode, enable_pcrd,

	 global_buf_d, global_buf_ofs_d);


CUDA_SAFE_CALL(cudaThreadSynchronize()); //wait…


After launching the kernel, program execution doesn’t return to CPU! There is a delay of 2 seconds, then “launched” and immediately “sync’ed” are printed. The same appears when debugging the kernel launch. When I “step over” the kernel launch, you can also see that the CPU program sleeps for 2 secs.

My system: WinXP Prof., VC++ 2005, CUDA SDK 2.1, GeForce 8600 GT

Thanks for your help, Martin

Kernel launching really is asynchronous, unless, you have queued up a large number of kernel launches and filled the CUDA driver FIFO (whose size seems to vary with OS and CUDA version, but I have an application which queues up sequences of up to 10 long running kernel launches at a time with total elapsed times of under a millisecond).

Unless you flush stdout after every write, I don’t think that you can use that pair of printf() calls to determine whether the launch is synchronous or asynchronous, because usually there is buffering. The only way to be sure is to set a high precision host timer just before the kernel launch, and then read it straight after the kernel launch and again after the cudaThreadSynchronize() call.

No, there is definitely only one kernel running. And time measuring gives the same result: the kernel launch really takes 3 seconds. I tried a version of the program with strweaming and a version without streaming.
Is this probably an issue of compute capability?

I recall seeing a few things mentioned in the forums that can make kernel launches synchronous. Using the profiler is one. Not sure if compiling in debug mode also does it.

No, we tried everything: Debug mode is off, also the profiler. We even called cudaStreamSynchronize() and cudaThreadSynchronize() before the kernel launch, so definitely no other kernel is running.

Are there any other possible causes of this problem?

Never trust buffered I/O for timing purposes. It could very well be buffering the “launched” line only to print it out later (i.e. with the syncd line). I’m not saying with certainty that this is what is going on here: just that it is possible. Precise wall-clock timing functions such as GetSystemTimeAsFileTime (windows) and gettimeofday (linux) are much better at testing for aync launches.

Are you 100% positive? You mention that you are using visual studio on windows. Are you certain that the project run environment doesn’t contain CUDA_PROFILE=1? Are you certain that someone has not set CUDA_PROFILE=1 in the system environment. Similarly, have you checked for the “always sync” environment variable setting? (I forget exactly what it is, look it up in the docs). The most certain way to verify that these are not set is to check for them within your program execution.

Does your kernel use any local memory? I have never tried it, but the memory allocation step needed before running a kernel using local memory might very well add an implicit sync within the kernel call.

(I’m also working on the project, so I answer instead of martin)

Yeah, we have used the right timing methods, but still with the same results.

And yes, 101% positive.Absolutely no profiler is on, also no always sync thing.

And the Kernel uses local memory, but why should the CPU not be able to go on working if there has to be done a memory allocation on GPU??? Perhaps any “official” statement on this?

Have you tried flushing the output stream as Avidday suggested?

printf( …
fflush( stdout );

A newline in the printf does not flush the stream, you have to do it after every printf if you want it to show up when it happens.

OK, we have finally succeeded making it asynchronous.

The problem really was the local memory, we replaced it by global memory which is only allocated once in the program and afterwards re-used. It turned out that cudaMalloc also causes synchronization, but it depends on the size of the allocated block. Allocating 1KB didn’t cause synchronization, but allocating 1MB did.

I can confirm that you cannot rely on printf() alone to check kernel launching time. Even fflush() didn’t result in immediate output on the console; probably because the GPU used for CUDA computation was also used for graphics display.

Thanks for your help, folks!

Congrats on getting this to work.

Hmm yes… I too found this out the hard way in my code after days of work.

Why doesn’t Nvidia states this in their programming guide ?? And why does local memory usage makes kernel calls sync ?