Problem: cuda calls are synchronized

Lev · December 14, 2010, 8:57pm

I got a problem, I found that cuda functions start to perform after cudasynchtread calls and other non-asynch function calls only.

i.e.

kernel<<>>

do_something(); //5 mins

cudathreadsynch(); // here kernell starts to run

so total time is 5 mins + kernell time

I am now testing it with different sdk etc, maybe somebody familiar with such error, thanks.

avidday · December 14, 2010, 9:16pm

You must be doing something wrong, whether you realize it or not. Kernel launches really are asynchronous. I have linear algebra libraries that do something like:

cudaMemcpy(htod)

cublasDgemm()

hostDgemm()

cudaMemcpy(dtoh)

on different parts of the same matrices, where the work split is calculated using an execution timing model so that the two gemm calls finish at exactly the same time, in the process effectively using the hostDgemm call hides the memcpy latency. Instrumentation shows that I can reliably predict and match the timings to within +/- 50ms on calls of about 2 seconds duration.

This has worked reliably for me since about CUDA 2.1. What OS is this on?

Lev · December 14, 2010, 9:28pm

You must be doing something wrong, whether you realize it or not. Kernel launches really are asynchronous. I have linear algebra libraries that do something like:
cudaMemcpy(htod)

cublasDgemm()

hostDgemm()

cudaMemcpy(dtoh)
on different parts of the same matrices, where the work split is calculated using an execution timing model so that the two gemm calls finish at exactly the same time, in the process effectively using the hostDgemm call hides the memcpy latency. Instrumentation shows that I can reliably predict and match the timings to within +/- 50ms on calls of about 2 seconds duration.

This has worked reliably for me since about CUDA 2.1. What OS is this on?

Yes!!! It should work this way, but on my system it does not… And it actually worked for me too some time ago too, but after reinstall it stopped. I migrated to win7 and use cuda 3.0, I am now checking error with other versions of sdk.

Lev · December 15, 2010, 1:03am

You must be doing something wrong, whether you realize it or not. Kernel launches really are asynchronous. I have linear algebra libraries that do something like:
cudaMemcpy(htod)

cublasDgemm()

hostDgemm()

cudaMemcpy(dtoh)
on different parts of the same matrices, where the work split is calculated using an execution timing model so that the two gemm calls finish at exactly the same time, in the process effectively using the hostDgemm call hides the memcpy latency. Instrumentation shows that I can reliably predict and match the timings to within +/- 50ms on calls of about 2 seconds duration.

This has worked reliably for me since about CUDA 2.1. What OS is this on?

I tested with last sdk 3.2, drivers 260.99 on win7 64bit, and still launches are synched.

Btw, does anybody know, if env variable CUDA_LAUNCH_BLOCKING set to 1, it disables asynch in comiple time or for application?

I wonder, could it be spoiled in compilation stage or itis run time system issue?

tmurray · December 15, 2010, 1:23am

This is a fun side effect of how the CUDA driver handles WDDM platforms. Once you enqueue a number of calls, a batch will be submitted. The kernel submission time on WDDM is roughly 15x that of Linux/TCC, so we have to batch calls in order to prevent huge performance problems. TCC doesn’t have this problem, and you can record an event after a kernel launch and query it in order to “defeat” the CUDA driver’s batching scheme.

Lev · December 15, 2010, 1:29am

Great thanks for exploration!!! I downloaded and installed a lot of skds. Is it right for win7 64bit and for win7 32 bit too?

How to work arround it for win7 32/64 bit?

What is batch number? I run relatively long cuda kernel and started some work after. I am sure that time is much large than launch time.

Can you please tell launch time in win7 and winxp on last sdk?

I think I need to launch a few empty kernells just to fill batch.

njuffa · December 15, 2010, 1:49am

To answer the other question, CUDA_LAUNCH_BLOCKING is an environment variable that affects run time behavior of a CUDA application. One can easily demonstrate this by running the same application with and without CUDA_LAUNCH_BLOCKING=1. This is a handy mechanism during debugging, although these days I find that I rarely need it anymore.

Lev · December 15, 2010, 1:54am

Thanks, I knew that information about that variable. I understand that my program started to behive like this variable was set. I started to think that it was set somewhere by error. I can say that decision to bacth call was strange and not documeneted well. It could double aplication time! Need to setit by application if to batch or not.

SPWorley · December 15, 2010, 5:39am

Every time I say to myself “I should really do more Windows stuff and play with NSight” I remember the perils of WDDM!

Is there a way to manually tell CUDA to submit a partially filled queue of calls? Or maybe a simple hack like sending 15 EventRecord()s to fill up the queue with effective noops and force it to submit?

Dittoaway · December 15, 2010, 4:30pm

“The kernel submission time on WDDM is roughly 15x that of Linux/TCC”

As a windows only user, I’ll have to take your word for that.
However, that is obviously NOT the problem. He says he isn’t getting a kernel launch for 5 MINUTES.

My (Windows 7) launches are immediate. maybe 15 times slower immediate than Linux :)

seibert · December 15, 2010, 6:27pm

Yeah, launch times on Linux are measured in tens of microseconds, so this effect is only an efficiency hit for really short kernels.

Lev · December 15, 2010, 9:33pm

Ok, I got 30% speed up of apllication, by the way, my actual kernell time is about 30 mili seconds.

tmurray · December 15, 2010, 10:17pm

Read what I said again. Kernel launches are batched until some number of calls are made OR the results are requested via some synchronization point. If you do

kernel<<<…>>>();

cudaMemcpy(); // partially empty batch actually gets flushed here

no problem. If you do

kernel<<<…>>>(); // only kernel call

a_giant_function_on_the_cpu(); // five minute function

cudaMemcpy(); // partially empty batch actually gets flushed here

The WAR I posted handles that.

Lev · December 15, 2010, 10:25pm

problem is that cudamemcpy is not return until it performed.

Dittoaway · December 16, 2010, 3:06pm

OK, now I’m really confused. Maybe I don’t understand what YOU mean by a kernel launch and batching.

Are you saying that in the very common case of a program having only one kernel, that that kernel doesn’t get launched until a synchronization point or a read from the device. If that were true, then launches are not asynchronous as documented. What does batching mean when you have only one kernel?

Sorry, but I also don’t understand your workaround.

“you can record an event after a kernel launch and query it in order to “defeat” the CUDA driver’s batching scheme.”

Can you give an example.

Lev · December 16, 2010, 4:23pm

“Are you saying that in the very common case of a program having only one kernel, that that kernel doesn’t get launched until a synchronization point or a read from the device. If that were true, then launches are not asynchronous as documented. What does batching mean when you have only one kernel?”

Yes, it does not. Batch is not full. And will be actually sent to device with next synch call like memcpy or threadsynch.

Dittoaway · December 16, 2010, 4:46pm

Well then, that is not what I would call an asynchronous kernel launch. The documentation on Asynchronous Concurrent Execution should specifically state this issue for Windows users. We’re not all that rare :)

rosetter · February 18, 2011, 12:59pm

An old thread, but I had to second this. Just wasted a couple of days finding this completely undocumented “feature.” Please update the documentation. Has it actually been years since this was implemented without any documentation?

Topic		Replies	Views
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12684	July 26, 2010
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3710	April 23, 2011
Kernel execution blocks CPU code CUDA Programming and Performance	9	3956	September 8, 2009
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5323	August 17, 2009
Need solution of "kernel launch timeout" from NVIDIA CUDA Programming and Performance	11	19370	March 4, 2009
Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure? CUDA Programming and Performance	6	739	March 20, 2018
Very slow kernel launches CUDA Programming and Performance	8	7684	March 28, 2015
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7733	December 5, 2008
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1707	July 19, 2022
A few new to CUDA questions CUDA Programming and Performance	3	1110	February 4, 2011

Problem: cuda calls are synchronized

Related topics