Copyout not working and wrong results from cuFFT

I have a problem with the following piece of code.
The program calls a subroutine that computes the FFT of the input using the cuFFT library.
The code is successfully compiled, the output however is an empty vector.
If I do everything in the same program (without subroutine), the program works fine.
Looks like a problem of memory synchronization.
I compile using:
nvfortran -fast -acc -gpu=managed -Minfo=accel fftx_sub.f90 -o fftx_sub -L/usr/local/cuda/lib64 -lcufft

program test
use, intrinsic :: iso_c_binding
implicit none
integer(c_int), parameter :: nx=32,fpz=33,fpy=32
real(c_double) :: u(nx,fpz,fpy)
complex(c_double_complex) :: uc(nx/2+1,fpz,fpy)
double precision :: dx
integer :: i
dx=6.28/(nx-1)
do i=1,nx+1
u(i,:,:)=sin((i-1)*dx)
enddo

do i=1,nx
write(,) “f(x)”, u(i,1,1)
enddo

call computefftx(u,uc)
end program

subroutine computefftx(u,uc)
use, intrinsic :: iso_c_binding
use cufft
use openacc
integer(c_int) :: inembed(3),onembed(3)
integer(c_int) :: nx,npz,npy
integer :: cudaplan_x_fwd,cudaplan_x_bwd
real(c_double) :: u(nx,npz,npy)
complex(c_double_complex) :: uc(nx/2+1,npz,npy)
integer :: istride,ostride,idist,odist
integer :: dims(1),i

nx=32
npz=32
npy=32
inembed=[nx,npz,npy]
onembed=[nx/2+1,npz,npy]
istride=1
ostride=1
idist=nx
odist=nx/2+1
gerr=0
dims(1)=nx

gerr=gerr+cufftPlanMany(cudaplan_x_fwd,1,dims,inembed,istride,idist,onembed,ostride,odist,CUFFT_D2Z,npz*npy)

do i=1,nx
write(,) “f(x)”, u(i,1,1)
enddo

!$acc data copyin(u) copyout(uc)
gerr=gerr+cufftSetStream(cudaplan_x_fwd,acc_get_cuda_stream(acc_async_sync))
!$acc host_data use_device(u,uc)
gerr=gerr+cufftExecD2Z(cudaplan_x_fwd,u,uc)
!$acc end host_data
!$acc end data
write(,) “gerr”, gerr

do i=1,nx/2
write(,) “spectral f(x)”, uc(i,1,1)
enddo

return
end

Disclaimer: Not too familiar with Fortran.

Did you synchronize the host thread (i.e., ensure kernel was finished and data copied back to host) before reading?

Yes, I’m copying out and when I read the results I’m on the host.
I double-check this and both on-device and host, the result is 0.
Is it maybe related to the fact that the subroutine only receives a pointer to the input and not the entire copy of it? and thus is not able to move it to the device?

Can you comment out the SetStream and see what happens?

gerr=gerr+cufftSetStream(cudaplan_x_fwd,acc_get_cuda_stream(acc_async_sync))

I think OpenACC and cuRAND are not on the same stream
More info - How to use CUDA stream in OpenACC - #4 by MatColgrove

Dear mnicely,

I get the same results commenting out the stream.
However, I found out that if I put this subroutine inside a module then I get the correct results.
Is it maybe related to how arrays ara passed to the subroutine?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.