CUDA async tutorial? How to use streams, events and so


for a number of reasons I’m stuck with CUDA 1.1, so no automatic async kernel execution.

I’d like to parallelize GPU and CPU work (not just mem copying); it seems that one can use streams and events (at least, that’s what I infer from the SDK “AsyncAPI” example), but the Programming Guide is really brief on the subject.

Is there a tutorial somewhere?

Thanks a lot,


It’s not the answer to your question, but I’m curious–why are you stuck with 1.1?

I’m in the final rush with a project and can’t afford any “surprise” (runtime problems, behavior changes, performances hits and so).

CUDA 2.0beta had strange problems on my reference machine (many segfaults with SDK examples); I choosed to take no risks and stuck with 1.1 for this project, since there are no clear-cut performance gains with 2.0 (I mostly use CUDA FFT by the way).


AFAIK kernel execution has always been asynchroneous.

Now that might be another story regarding asynchroneous calls.

Regarding the crashes, I have had major stability problems with the 2.0 beta, however 2.0 final solved all of those for me. I understand your point of being close the project deadline, so.

I’ve thought a lot about async operation, but haven’t gone down that road. May I ask what specifically you’re thinking about async? What pitfalls do you see in, in your minds-eye like? What benefits do you want?

I have some thoughts on it, but they’re too jumbled to share straight out - I do hope that I can be of some help though.

Edit: I can second Cuda 2.0’s stability. Btw. The final version is nice. 1.1 and 2.0 can coexist on the same machine, too. Still, in your case, final project rush, I’d probably also stick with 1.1 :)

There’s a point in my application where I have to initialize a series of trascendental polynomials and perform the first 2D FFT of an image.

Since the polynomials stuff contains a lot of branches, I don’t think it would gain a lot if implemented on the GPU; so I was thinking to do it on the CPU while at the same time performing the 2D FFT with CUDA FFT.

So something like

  1. Asynchronously launch CUDA FFT

  2. Initialize polynomials on the CPU

  3. Asynchronously transfer the arrays filled with polynomials values on the GPU

  4. Be sure that the FFT is done and that the arrays were transferred

  5. Start the new calculations on the GPU (between the FFTed array and the polynomials array)

By looking at samples and at the Programming Guide I think this could be done with streams and events, but that’s it. There’s not much depth about those methods.

I would really need some sort of tutorial about that. :huh:

Thank you!