CUDA Fortran and PGI Accelerator mix

Greetings. Is the mix of CUDA Fortran (-Mcuda) with the PGI Accelerator Model (-ta=nvidia) supported? I saw a post from April 2010 that they shouldn’t be used together, but at some point they may.


Hi BL,

Yes, they are now supported together. At one point they were using different CUDA APIs but we have since merged them so that they are now compatible on all platforms (I did work on Linux before but not on Windows). Note that the accelerator directives do recognize CUDA Fortran device variables so don’t copy these variables. We also added a “!$CUF” directive to CUDA Fortran (See: which is essentially a ‘lite’ version of the PGI Accelerator model. It does not automate data movement but does create device kernels for you. It also uses the CUDA chevron syntax to give you control of the loop schedule.

Hope this helps,

Thanks! That’s great to hear.
I tried to test mixing cuda fortran and pgi accelerator directives. The code shown compiles fine but I get an error at runtime. I’m using Windows.

program fft_test
use cudafor
use precision
use cufft
complex(fp_kind) ,allocatable:: a(:),b(:),c(:)
complex(fp_kind),device,allocatable:: a_d(:),b_d(:)
integer:: n
integer:: plan


! allocate arrays on the host
allocate (a(n),b(n),c(n))

! allocate arrays on the device
allocate (a_d(n))
allocate (b_d(n))

!initialize arrays on host

!copy arrays to device

! Print initial array
print *, "Array A:"
print *, a

! Initialize the plan
call cufftPlan1D(plan,n,CUFFT_Z2Z,1)

! Execute FFTs
call cufftExecZ2Z(plan,a_d,b_d,CUFFT_FORWARD)

!call cufftExecZ2Z(plan,b_d,b_d,CUFFT_INVERSE)

! Copy results back to host

! Print initial array
print *, "Array B"
print *, b

! Add arrays
!$acc region
do j=1,n
!$acc end region
print *, "Array C"
print *, c

!release memory on the host
deallocate (a,b,c)

!release memory on the device
deallocate (a_d,b_d)

! Destroy the plan
call cufftDestroy(plan)

end program fft_test

This is the compile output

>pgf90 precision.f90 cufft.f90 fft_test.f90 -o main -Mcuda=cuda3.2 -ta=nvidia:cuda3.2 -Minfo "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\lib\x64\cufft.lib"
     20, Memory set idiom, array assignment replaced by call to pgf90_msetz16
         Memory zero idiom, array assignment replaced by call to pgf90_mzeroz16
     49, Generating copyin(b(1:8))
         Generating copyin(a(1:8))
         Generating copyout(c(1:8))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     50, Loop is parallelizable
         Accelerator kernel generated
         50, !$acc do parallel, vector(8) ! blockidx%x threadidx%x
             CC 1.3 : 11 registers; 52 shared, 4 constant, 0 local memory bytes;
 25% occupancy
             CC 2.0 : 18 registers; 4 shared, 64 constant, 0 local memory bytes;
 16% occupancy

This is the output

 Array A:
 (1.000000000000000,0.000000000000000)  (1.000000000000000,0.000000000000000)
 (1.000000000000000,0.000000000000000)  (1.000000000000000,0.000000000000000)
 (1.000000000000000,0.000000000000000)  (1.000000000000000,0.000000000000000)
 (1.000000000000000,0.000000000000000)  (1.000000000000000,0.000000000000000)
 Array B
 (8.000000000000000,0.000000000000000)  (0.000000000000000,0.000000000000000)
 (0.000000000000000,0.000000000000000)  (0.000000000000000,0.000000000000000)
 (0.000000000000000,0.000000000000000)  (0.000000000000000,0.000000000000000)
 (0.000000000000000,0.000000000000000)  (0.000000000000000,0.000000000000000)
call to cuMemAlloc returned error 201: Invalid context
CUDA driver version: 3020

This error seems to occur when the accelerated region is entered. What does this error mean? I first compiled without specifying cuda3.2 and thought that was causing a mismatch.


Hi BL,

You’re missing an interface for the CUFFT routines. Without an interface, the compiler must treat the calls using F77 calling semantics which are incorrect here.

Take a look at this article from the latest PGInsider (, which shows how to call the CUBLAS, CULA, and Magma BLAS libraries. The same methods can be used to call CUFFT.

Hope this helps,

Hello Mat,
I am using an interface for the CUFFT library. The output for array b shows that the call to the cufft routine was successful. Array b is the transform of array a. It is when the program enters the !$acc region that I get the error “call to cuMemAlloc returned error 201: Invalid context”.

Is this still due to an interface problem?


Hi BL,

I am using an interface for the CUFFT library.

Sorry, I was in a rush yesterday and missed the ‘use cufft’.

Is this still due to an interface problem?

Probably not but I would need to investigate further to determine the actual problem. Let me try to reproduce the error and see what I can determine.

  • Mat

Hi BL,

It appears that you’re using the example CUDA Fortran calling CUFFT code found the CUDA Musing blog ( Using the same cufft module and your modified source, I was able to build and run the exe with both CUDA and the PGI Accelerator model enabled. Unfortunately, the code ran correctly and I did not see the reported error.

Most likely it’s a problem with your CUDA device driver. Do you mind trying to update to the latest version for your device?

  • Mat

Thanks Mat. Are you referring to CUDA v4.0?


Are you referring to CUDA v4.0?

You can find the latest CUDA 3.2 development drivers here: The CUDA 4.0 development drivers should work as well but they are still in pre-release (

  • Mat