I need to use cuFFT together with OpenACC (called from FORTRAN). Then I started getting incorrect results occasionally. By looking at the profile trace I could see that the OpenACC kernels are launched in “Stream 13” but the cuFFT kernels get launched in the default stream. Therefore they don’t synchronize and sometimes the kernels overlap, producing errors.
First I inserted acc wait directives around the calls to cuFFT (in the FORTRAN code). This had no observable effect.
I then stuck cudaDeviceSynchronize around the cuFFT calls on the C side. The program now produces correct results, but this is obviously not a good solution.
- What is the expected behaviour of acc wait WRT cuda streams (in particular the default stream).
- What is the correct way to handle this situation.
- Could I make OpenACC run on the default stream too?
Versions of things used: