GPU and CPU don't run in (pure) parallel ?

Disappointing not to hear any how and when :( I read elsewhere that next month’s release is 1.0 - is that true?
I will have to nice off the runtime in production and put up with the extra latency till it is fixed. From reading elsewhere kernel launch latency is already anything from 10-20ms - where is that going?
The hardware is brilliant, the tools are very well done and the driver/host interface is clunky. Have you considered open sourcing this bit and driving it (rather than the usual inferior competing open source driver)? It is obviously not core for Nvidia.
Eric

Don’t worry, we hear you! :)

The 1.0 release will have non-blocking kernel launches. This means that the CPU will not spin-wait for the kernel to finish. I believe this is what you are asking for.

This is already in and we are testing it.

There are some things developers will have to be aware of:

  1. You won’t be able to directly time kernel launches anymore, since they will return immediately. If you are including transfers to/from the device in your timing, then your results will be fine, but otherwise they will be artificially fast. :)
  2. You may not get a correct indication of where errors occured. You might not get an error from one kernel until after multiple kernels have launched.

We will have mechanisms to deal with these issues. In CUDA 1.0 We will provide a CPU-GPU sync function (similar to glFinish()), but I stress that this should only be used during development and avoided in production code because synchronizing the CPU and GPU can greatly limit performance on real-world apps.

We may also eventually provide other ways to force launches to block (such as a debug mode), but we are still working on the design of this so it will not be in CUDA 1.0.

Thanks,
Mark

This is very good news, Mark. Will 1.0 also provide async mem transfers?

Peter

In 1.0 there will be an environment variable to re-enable the blocking mode. It is very convenient for timing and debug.

No, not yet. This is much more difficult to implement, and the CUDA team decided to they would rather get CUDA 1.0 out with all its other improvements than to delay it for this feature.

Mark