I am just getting started with nvfortran and cufft, so my question may be easy - I sure hope it is. The test code below is based on the example here:
https://docs.nvidia.com/hpc-sdk/compilers/fortran-cuda-interfaces/index.html#cflib-fft-cuf-host
but adjusted to run the complex-to-complex forward transform. I compile it with the line:
nvfortran -cudalib=cufft -Wl,-rpath,/<path to nvhpc/23.9 library> -o test_fft_1D.x test_fft_1D.cuf
on a shared supercomputer with Tesla P100 cards running CentOS Linux 7. The -rpath switch is necessary to include the right runtime libraries.
It runs, but the result of the transform is 0. In fact, it seems that the output variable “d” is not referenced/changed at all. What am I missing?
I ran “nvprof” and attached the output. Clearly, data are transferred from host to device, but I see no mention of a transfer back to the host, nor do I see any calls that appear to execute the transform.
_________ code _________
program test_fft_1D
use cudafor
use cufft
implicit none
integer, parameter :: n=2**20
integer :: fft_ierr,i,plan
double precision :: a(n),b(n),err
complex(kind=8), managed :: c(n),d(n)
complex(kind=8), parameter :: AI=(0.d0,1.d0)
! Initialize CUDA FFT:
fft_ierr = cufftPlan1D(plan,n,CUFFT_C2C,1)
if(fft_ierr.ne.0) then
print *,'Trouble with cufftPlan1D, returns flag ',fft_ierr
end if
! Set random input vector:
call random_number(a)
call random_number(b)
c = a + AI * b
print *,'Norm of input vector is ',sqrt(sum(c * conjg(c)))
! Compute the forward transform:
fft_ierr = cufftExecZ2Z(plan,c,d,CUFFT_FORWARD)
! Synchronize device/host memory:
fft_ierr = cudaDeviceSynchronize()
! Investigate output vector:
print *,'Norm of output vector is ',sqrt(sum(d * conjg(d)))
! Destroy plan and exit:
fft_ierr = cufftDestroy(plan)
end program main
________ result _________
Norm of input vector is (836.0109658087907,0.000000000000000)
Norm of output vector is (0.000000000000000,0.000000000000000)
_____ nvprof output _____
==20546== NVPROF is profiling process 20546, command: ./test_fft_1D.x
Norm of input vector is (836.0109658087907,0.000000000000000)
Norm of output vector is (0.000000000000000,0.000000000000000)
==20546== Profiling application: ./test_fft_1D.x
==20546== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 9.6275ms 4 2.4069ms 640ns 5.0542ms [CUDA memcpy HtoD]
API calls: 88.98% 130.46ms 2 65.228ms 53.748us 130.40ms cudaMallocManaged
3.75% 5.5001ms 3 1.8334ms 16.358us 5.4479ms cuMemcpyHtoD
3.65% 5.3441ms 1 5.3441ms 5.3441ms 5.3441ms cudaMemcpy
1.68% 2.4690ms 2 1.2345ms 821.80us 1.6472ms cudaFree
0.43% 625.91us 2 312.95us 235.86us 390.05us cuModuleLoadData
0.36% 527.63us 231 2.2840us 199ns 120.79us cuDeviceGetAttribute
0.31% 454.34us 2 227.17us 225.30us 229.04us cuMemAlloc
0.30% 446.17us 2 223.09us 149.19us 296.98us cuMemFree
0.17% 249.29us 390 639ns 394ns 4.2500us cuGetProcAddress
0.13% 185.52us 1 185.52us 185.52us 185.52us cudaSetDevice
0.07% 106.91us 2 53.455us 35.200us 71.710us cuModuleUnload
0.03% 51.175us 3 17.058us 12.950us 23.534us cuDeviceGetName
0.02% 35.429us 1 35.429us 35.429us 35.429us cuMemGetInfo
0.02% 30.377us 1 30.377us 30.377us 30.377us cudaDeviceSynchronize
0.01% 15.376us 3 5.1250us 1.9180us 8.6400us cudaGetDevice
0.01% 15.126us 20 756ns 388ns 5.0420us cuFuncGetAttribute
0.01% 13.401us 1 13.401us 13.401us 13.401us cuPointerGetAttribute
0.01% 12.850us 2 6.4250us 1.6440us 11.206us cuModuleGetGlobal
0.01% 12.138us 2 6.0690us 1.4370us 10.701us cuOccupancyMaxActiveBlocksPerMultiprocessor
0.01% 11.630us 2 5.8150us 2.0950us 9.5350us cuModuleGetFunction
0.01% 10.065us 7 1.4370us 345ns 4.5140us cuCtxPushCurrent
0.00% 6.0340us 1 6.0340us 6.0340us 6.0340us cuInit
0.00% 5.6550us 1 5.6550us 5.6550us 5.6550us cuDeviceGetPCIBusId
0.00% 5.0270us 7 718ns 294ns 1.0490us cuCtxPopCurrent
0.00% 4.6240us 4 1.1560us 348ns 3.0910us cuDeviceGetCount
0.00% 3.9840us 2 1.9920us 1.0800us 2.9040us cuDeviceTotalMem
0.00% 3.3600us 6 560ns 374ns 715ns cuCtxGetCurrent
0.00% 2.6510us 1 2.6510us 2.6510us 2.6510us cuPointerGetAttributes
0.00% 1.5800us 3 526ns 277ns 852ns cuDeviceGet
0.00% 1.4320us 2 716ns 670ns 762ns cuModuleGetLoadingMode
0.00% 1.1530us 2 576ns 525ns 628ns cuDeviceGetUuid
0.00% 544ns 1 544ns 544ns 544ns cuDriverGetVersion
0.00% 398ns 1 398ns 398ns 398ns cuCtxGetDevice