acc wait ignoring CUDA default stream


I need to use cuFFT together with OpenACC (called from FORTRAN). Then I started getting incorrect results occasionally. By looking at the profile trace I could see that the OpenACC kernels are launched in “Stream 13” but the cuFFT kernels get launched in the default stream. Therefore they don’t synchronize and sometimes the kernels overlap, producing errors.

First I inserted acc wait directives around the calls to cuFFT (in the FORTRAN code). This had no observable effect.

I then stuck cudaDeviceSynchronize around the cuFFT calls on the C side. The program now produces correct results, but this is obviously not a good solution.


  1. What is the expected behaviour of acc wait WRT cuda streams (in particular the default stream).
  2. What is the correct way to handle this situation.
  3. Could I make OpenACC run on the default stream too?

Versions of things used:


Hi ola,

The best way to do this is to get the stream OpenACC is using and then set cuFFT’s stream to be the same.

integer(kind=8) streamid  ! or kind=4 if in 32-bits
streamid = acc_get_cuda_stream(acc_async_sync)

This call gets the stream id used by OpenACC.

Next you’d create the cufft plan. For example:

call cufftPlan1D(plan,n,CUFFT_Z2Z,1)

Now at this point, you can then tell cufft to use the same stream as OpenACC by doing the following:

call cufftSetStream(plan,streamid)

Hope this helps,

Thanks. Is there a special number (zero?) to get the default async, or is it necessary to specify an async clause for all operations?

I’d also like to know if this behaviour is documented anywhere, because it took me rather by surprise that the default stream is different in CUDA and OpenACC, and the errors can be fairly subtle. Also this does not appear to be widely known, e.g.,

makes no mention of this issue, nor have I seen it in any of the GTC presentations from NVIDIA that I’ve watched etc. Surely this problem must cause problems for any OpenACC/CUDA interop unless explicit streams are used everywhere?

Hi ola,

You can link with “-Mcuda” to have PGI’s OpenACC implementation to use stream 0 as the default. Given the ORNL example would fail to link without it, I’m guessing it’s implied by the “ftn” driver.

However, there’s issues with using stream 0 and why we don’t use it as the default stream. In particular, if there are any activity on stream 0 in-between other non-zero streams, then the other streams can’t be run concurrently.

As of CUDA 7.0, nvcc users are able to compile their code with " --default stream per-thread". This causes each host thread to have a unique non-zero default stream. In this scenario, having OpenACC simply default to stream 0 may not work. (See:

Because of these two issues, I’ve recently started recommending users use the method described above rather than use “-Mcuda” when using OpenACC+cuFFT given it avoids the problems of using stream 0.

  • Mat