Problem: cuda calls are synchronized

I got a problem, I found that cuda functions start to perform after cudasynchtread calls and other non-asynch function calls only.

i.e.

kernel<<>>

do_something(); //5 mins

cudathreadsynch(); // here kernell starts to run

so total time is 5 mins + kernell time

I am now testing it with different sdk etc, maybe somebody familiar with such error, thanks.

You must be doing something wrong, whether you realize it or not. Kernel launches really are asynchronous. I have linear algebra libraries that do something like:

cudaMemcpy(htod)

cublasDgemm()

hostDgemm()

cudaMemcpy(dtoh)

on different parts of the same matrices, where the work split is calculated using an execution timing model so that the two gemm calls finish at exactly the same time, in the process effectively using the hostDgemm call hides the memcpy latency. Instrumentation shows that I can reliably predict and match the timings to within +/- 50ms on calls of about 2 seconds duration.

This has worked reliably for me since about CUDA 2.1. What OS is this on?

Yes!!! It should work this way, but on my system it does not… And it actually worked for me too some time ago too, but after reinstall it stopped. I migrated to win7 and use cuda 3.0, I am now checking error with other versions of sdk.

I tested with last sdk 3.2, drivers 260.99 on win7 64bit, and still launches are synched.

Btw, does anybody know, if env variable CUDA_LAUNCH_BLOCKING set to 1, it disables asynch in comiple time or for application?

I wonder, could it be spoiled in compilation stage or itis run time system issue?

This is a fun side effect of how the CUDA driver handles WDDM platforms. Once you enqueue a number of calls, a batch will be submitted. The kernel submission time on WDDM is roughly 15x that of Linux/TCC, so we have to batch calls in order to prevent huge performance problems. TCC doesn’t have this problem, and you can record an event after a kernel launch and query it in order to “defeat” the CUDA driver’s batching scheme.

Great thanks for exploration!!! I downloaded and installed a lot of skds. Is it right for win7 64bit and for win7 32 bit too?

How to work arround it for win7 32/64 bit?

What is batch number? I run relatively long cuda kernel and started some work after. I am sure that time is much large than launch time.

Can you please tell launch time in win7 and winxp on last sdk?

I think I need to launch a few empty kernells just to fill batch.

To answer the other question, CUDA_LAUNCH_BLOCKING is an environment variable that affects run time behavior of a CUDA application. One can easily demonstrate this by running the same application with and without CUDA_LAUNCH_BLOCKING=1. This is a handy mechanism during debugging, although these days I find that I rarely need it anymore.

Thanks, I knew that information about that variable. I understand that my program started to behive like this variable was set. I started to think that it was set somewhere by error. I can say that decision to bacth call was strange and not documeneted well. It could double aplication time! Need to setit by application if to batch or not.

Every time I say to myself “I should really do more Windows stuff and play with NSight” I remember the perils of WDDM!

Is there a way to manually tell CUDA to submit a partially filled queue of calls? Or maybe a simple hack like sending 15 EventRecord()s to fill up the queue with effective noops and force it to submit?

“The kernel submission time on WDDM is roughly 15x that of Linux/TCC”

As a windows only user, I’ll have to take your word for that.
However, that is obviously NOT the problem. He says he isn’t getting a kernel launch for 5 MINUTES.

My (Windows 7) launches are immediate. maybe 15 times slower immediate than Linux :)

Yeah, launch times on Linux are measured in tens of microseconds, so this effect is only an efficiency hit for really short kernels.

Ok, I got 30% speed up of apllication, by the way, my actual kernell time is about 30 mili seconds.

Read what I said again. Kernel launches are batched until some number of calls are made OR the results are requested via some synchronization point. If you do

kernel<<<…>>>();

cudaMemcpy(); // partially empty batch actually gets flushed here

no problem. If you do

kernel<<<…>>>(); // only kernel call

a_giant_function_on_the_cpu(); // five minute function

cudaMemcpy(); // partially empty batch actually gets flushed here

The WAR I posted handles that.

problem is that cudamemcpy is not return until it performed.

OK, now I’m really confused. Maybe I don’t understand what YOU mean by a kernel launch and batching.

Are you saying that in the very common case of a program having only one kernel, that that kernel doesn’t get launched until a synchronization point or a read from the device. If that were true, then launches are not asynchronous as documented. What does batching mean when you have only one kernel?

Sorry, but I also don’t understand your workaround.

“you can record an event after a kernel launch and query it in order to “defeat” the CUDA driver’s batching scheme.”

Can you give an example.

“Are you saying that in the very common case of a program having only one kernel, that that kernel doesn’t get launched until a synchronization point or a read from the device. If that were true, then launches are not asynchronous as documented. What does batching mean when you have only one kernel?”

Yes, it does not. Batch is not full. And will be actually sent to device with next synch call like memcpy or threadsynch.

Well then, that is not what I would call an asynchronous kernel launch. The documentation on Asynchronous Concurrent Execution should specifically state this issue for Windows users. We’re not all that rare :)

An old thread, but I had to second this. Just wasted a couple of days finding this completely undocumented “feature.” Please update the documentation. Has it actually been years since this was implemented without any documentation?