Kernel execution blocks CPU code

martinii · August 29, 2009, 12:32pm

Hi,

I’ve got a problem concerning kernel launches. In all documentations, they are said to be asynchronous. But for me, this doesn’t seem to be true:

[codebox]printf(“Kernel launch: %d blocks x %d threads, %d bytes shared mem/block\n”, dimGrid.x, dimBlock.x, s_mem_size);

encode_cb_kernel<<< dimGrid, dimBlock, s_mem_size >>>

	(rawcbs_d, cbs_d, n_cbs, slope_max_d, pic->xSize, mode, enable_pcrd,

	 global_buf_d, global_buf_ofs_d);

printf(“launched!\n”);

CUDA_SAFE_CALL(cudaThreadSynchronize()); //wait…

printf(“sync’ed!\n”);[/codebox]

After launching the kernel, program execution doesn’t return to CPU! There is a delay of 2 seconds, then “launched” and immediately “sync’ed” are printed. The same appears when debugging the kernel launch. When I “step over” the kernel launch, you can also see that the CPU program sleeps for 2 secs.

My system: WinXP Prof., VC++ 2005, CUDA SDK 2.1, GeForce 8600 GT

Thanks for your help, Martin

avidday · August 29, 2009, 3:04pm

Kernel launching really is asynchronous, unless, you have queued up a large number of kernel launches and filled the CUDA driver FIFO (whose size seems to vary with OS and CUDA version, but I have an application which queues up sequences of up to 10 long running kernel launches at a time with total elapsed times of under a millisecond).

Unless you flush stdout after every write, I don’t think that you can use that pair of printf() calls to determine whether the launch is synchronous or asynchronous, because usually there is buffering. The only way to be sure is to set a high precision host timer just before the kernel launch, and then read it straight after the kernel launch and again after the cudaThreadSynchronize() call.

martinii · August 29, 2009, 3:52pm

No, there is definitely only one kernel running. And time measuring gives the same result: the kernel launch really takes 3 seconds. I tried a version of the program with strweaming and a version without streaming.
Is this probably an issue of compute capability?

seibert · August 29, 2009, 11:08pm

I recall seeing a few things mentioned in the forums that can make kernel launches synchronous. Using the profiler is one. Not sure if compiling in debug mode also does it.

martinii · August 31, 2009, 9:49am

No, we tried everything: Debug mode is off, also the profiler. We even called cudaStreamSynchronize() and cudaThreadSynchronize() before the kernel launch, so definitely no other kernel is running.

Are there any other possible causes of this problem?

MisterAnderson42 · August 31, 2009, 12:20pm

Never trust buffered I/O for timing purposes. It could very well be buffering the “launched” line only to print it out later (i.e. with the syncd line). I’m not saying with certainty that this is what is going on here: just that it is possible. Precise wall-clock timing functions such as GetSystemTimeAsFileTime (windows) and gettimeofday (linux) are much better at testing for aync launches.

Are you 100% positive? You mention that you are using visual studio on windows. Are you certain that the project run environment doesn’t contain CUDA_PROFILE=1? Are you certain that someone has not set CUDA_PROFILE=1 in the system environment. Similarly, have you checked for the “always sync” environment variable setting? (I forget exactly what it is, look it up in the docs). The most certain way to verify that these are not set is to check for them within your program execution.

Does your kernel use any local memory? I have never tried it, but the memory allocation step needed before running a kernel using local memory might very well add an implicit sync within the kernel call.

Sinner0815 · September 4, 2009, 1:32pm

Never trust buffered I/O for timing purposes. It could very well be buffering the “launched” line only to print it out later (i.e. with the syncd line). I’m not saying with certainty that this is what is going on here: just that it is possible. Precise wall-clock timing functions such as GetSystemTimeAsFileTime (windows) and gettimeofday (linux) are much better at testing for aync launches.

Are you 100% positive? You mention that you are using visual studio on windows. Are you certain that the project run environment doesn’t contain CUDA_PROFILE=1? Are you certain that someone has not set CUDA_PROFILE=1 in the system environment. Similarly, have you checked for the “always sync” environment variable setting? (I forget exactly what it is, look it up in the docs). The most certain way to verify that these are not set is to check for them within your program execution.

Does your kernel use any local memory? I have never tried it, but the memory allocation step needed before running a kernel using local memory might very well add an implicit sync within the kernel call.

(I’m also working on the project, so I answer instead of martin)

Yeah, we have used the right timing methods, but still with the same results.

And yes, 101% positive.Absolutely no profiler is on, also no always sync thing.

And the Kernel uses local memory, but why should the CPU not be able to go on working if there has to be done a memory allocation on GPU??? Perhaps any “official” statement on this?

spurious_interrupt · September 4, 2009, 3:55pm

Have you tried flushing the output stream as Avidday suggested?

printf( …
fflush( stdout );

A newline in the printf does not flush the stream, you have to do it after every printf if you want it to show up when it happens.

martinii · September 6, 2009, 12:28pm

OK, we have finally succeeded making it asynchronous.

The problem really was the local memory, we replaced it by global memory which is only allocated once in the program and afterwards re-used. It turned out that cudaMalloc also causes synchronization, but it depends on the size of the allocated block. Allocating 1KB didn’t cause synchronization, but allocating 1MB did.

I can confirm that you cannot rely on printf() alone to check kernel launching time. Even fflush() didn’t result in immediate output on the console; probably because the GPU used for CUDA computation was also used for graphics display.

Thanks for your help, folks!

nitin.life · September 8, 2009, 9:58pm

Congrats on getting this to work.

Hmm yes… I too found this out the hard way in my code after days of work.

Why doesn’t Nvidia states this in their programming guide ?? And why does local memory usage makes kernel calls sync ?

Thanks

Topic		Replies	Views
Async Kernel launch cpu seems not getting control after kernel launch CUDA Programming and Performance	7	3155	December 3, 2008
Local memory management CUDA Programming and Performance	9	653	April 26, 2024
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1244	November 19, 2010
Multi kernels CUDA Programming and Performance	13	4658	July 3, 2007
Problem: cuda calls are synchronized CUDA Programming and Performance	17	2843	February 18, 2011
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	906	June 18, 2010
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
kernel printf strange behaviour of printf in __global__ sub CUDA Programming and Performance	1	3906	February 22, 2011
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25061	March 8, 2010
Launch Timeouts CUDA Programming and Performance	32	21791	May 4, 2011

Kernel execution blocks CPU code

Related topics