on different parts of the same matrices, where the work split is calculated using an execution timing model so that the two gemm calls finish at exactly the same time, in the process effectively using the hostDgemm call hides the memcpy latency. Instrumentation shows that I can reliably predict and match the timings to within +/- 50ms on calls of about 2 seconds duration.
This has worked reliably for me since about CUDA 2.1. What OS is this on?
Yes!!! It should work this way, but on my system it does not… And it actually worked for me too some time ago too, but after reinstall it stopped. I migrated to win7 and use cuda 3.0, I am now checking error with other versions of sdk.
This is a fun side effect of how the CUDA driver handles WDDM platforms. Once you enqueue a number of calls, a batch will be submitted. The kernel submission time on WDDM is roughly 15x that of Linux/TCC, so we have to batch calls in order to prevent huge performance problems. TCC doesn’t have this problem, and you can record an event after a kernel launch and query it in order to “defeat” the CUDA driver’s batching scheme.
To answer the other question, CUDA_LAUNCH_BLOCKING is an environment variable that affects run time behavior of a CUDA application. One can easily demonstrate this by running the same application with and without CUDA_LAUNCH_BLOCKING=1. This is a handy mechanism during debugging, although these days I find that I rarely need it anymore.
Thanks, I knew that information about that variable. I understand that my program started to behive like this variable was set. I started to think that it was set somewhere by error. I can say that decision to bacth call was strange and not documeneted well. It could double aplication time! Need to setit by application if to batch or not.
Every time I say to myself “I should really do more Windows stuff and play with NSight” I remember the perils of WDDM!
Is there a way to manually tell CUDA to submit a partially filled queue of calls? Or maybe a simple hack like sending 15 EventRecord()s to fill up the queue with effective noops and force it to submit?
OK, now I’m really confused. Maybe I don’t understand what YOU mean by a kernel launch and batching.
Are you saying that in the very common case of a program having only one kernel, that that kernel doesn’t get launched until a synchronization point or a read from the device. If that were true, then launches are not asynchronous as documented. What does batching mean when you have only one kernel?
Sorry, but I also don’t understand your workaround.
“you can record an event after a kernel launch and query it in order to “defeat” the CUDA driver’s batching scheme.”
“Are you saying that in the very common case of a program having only one kernel, that that kernel doesn’t get launched until a synchronization point or a read from the device. If that were true, then launches are not asynchronous as documented. What does batching mean when you have only one kernel?”
Yes, it does not. Batch is not full. And will be actually sent to device with next synch call like memcpy or threadsynch.
Well then, that is not what I would call an asynchronous kernel launch. The documentation on Asynchronous Concurrent Execution should specifically state this issue for Windows users. We’re not all that rare :)
An old thread, but I had to second this. Just wasted a couple of days finding this completely undocumented “feature.” Please update the documentation. Has it actually been years since this was implemented without any documentation?