Fortran and cuFFT

What is the best way to call the cuFFT functions from an existing fortran program which uses the fftw3 library calls.

The last problem I am having is that the fortran compiler is case-insensitive for the generated function names. I can get rid of the underscore with a compiler option but all functions are lower-case only so they are not similar to the cuFFT library names. There is a fortran wrapper for the cublas library but not for the cuFFT library.
Is there a fortran wrapper for the cuFFT library planed?

OOps!
Forget about that, I was not thinking…
nvcc is still needed

Ok. I did progress a bit on this problem.

However, I have an error message which is not in the latest cuFFT user guide. Maybe it is somewhere else.

In calling cufftPlan3D:

cufft: Failed to find applicable transform

Can someone show me what this error means?

And why does this piece compile properly?

From SDK2.3

from simpleCUFFT:

// CUFFT plan

cufftHandle plan;

cufftSafeCall(cufftPlan1d(&plan, new_size, CUFFT_C2C, 1));

this is too many arguments if I try in my code.

cufftPlan1d(&plan, new_size, CUFFT_C2C, 1)

That is the correct calling sequence for 1D transform.

Silly me. I did not notice the 1d and I am working with 3d.

How about my previous question on the error message?

Are you taking in account that Fortran is passing arguments by reference and C by value?
Take a look at the Fortran example that calls CUDA code in this presentation (http://gpgpu.org/wp/wp-content/uploads/2009/06/03-Toolkit.pdf )

The arguments passing should be OK. The presentation is very short on the F->CUDA. I would have to change the subroutine too much to follow this approach and I don’t know how much ripple effect it would have on the others subs. I think I already modify as much as I could at this stage.

Back to Fortran to Cuda calls

Which is the correct way?

  1. Fortran
    real*8 ::r(0:2,0:3,0:2)
    …call fun ( i,j,k , r)
    Cuda
    void fun_ ( int *i , int *j, int *k, cufftReal *r);
    In this case, the array r is 1d and I have to use cufftPlan1d

Is the elements order correct? Will the fft be OK?

  1. Fortran
    real*8 ::r(0:2,0:2,0:2)
    …call fun ( i,j,k , r)
    Cuda

    void fun_ ( int *i , int *j, int *k, cufftReal r[2][3][2]);
    In that case, the array r is 3d so I can use cufftPlan3d, but I cannot have variables to specify the array sizes. Can I use r in the function?

Do I have to modify the elements order to agree with fortran? will the fft be OK?

On the return form the fft, do I have to modify the array order before going back the the calling fortran program?

And the solution is :

  1. do the host memory allocation on the fortran side using this:

[codebox]module cuda_alloc

interface

! cudaMallocHost

integer (C_INT) function cudaMallocHost(buffer, size) bind(C,name=“cudaMallocHost”)

use iso_c_binding

implicit none

type (C_PTR) :: buffer

integer (C_LONG), value :: size

end function cudaMallocHost

! cudaFreeHost

integer (C_INT) function cudaFreeHost(buffer) bind(C,name=“cudaFreeHost”)

use iso_c_binding

implicit none

type (C_PTR), value :: buffer

end function cudaFreeHost

end interface

end module cuda_alloc

[/codebox]

  1. call the cuFFT program using this subroutine :

[codebox]

subroutine r2c_3d( Nx, Ny, Nz, r, c)

use iso_c_binding

use cuda_alloc

implicit none

integer Nx, Ny, Nz,res

real*8 :: r(0:Nx-1, 0:Ny-1, 0:Nz-1)

complex*16 :: c(0:Nx/2, 0:Ny-1, 0:Nz-1)

complex*16 , parameter :: size1=dcmplx(0.0,0.0)

real*8 , parameter :: size2=dble(0.0)

type(C_PTR) :: cptr_cd, cptr_rd

complex*16 , dimension(:,:,:), pointer :: cd

real*8 , dimension(:,:,:), pointer :: rd

res = cudaMallocHost ( cptr_cd,NxNyNz*sizeof(size1)/2+1 )

call c_f_pointer ( cptr_cd,cd , [Nx/2+1,Ny,Nz] )

res = cudaMallocHost ( cptr_rd, NxNyNz*sizeof(size2) )

call c_f_pointer ( cptr_rd,rd , [Nx,Ny,Nz ] )

rd = r

call fft_3d_r2c( Nx, Ny, Nz, rd, cd)

c= cd

res = cudaFreeHost(cptr_rd)

res = cudaFreeHost(cptr_cd)

end subroutine r2c_3d

subroutine c2r_3d(Nx, Ny, Nz, c, r)

use iso_c_binding

use cuda_alloc

implicit none

integer Nx, Ny, Nz,res

real*8 :: r(0:Nx-1, 0:Ny-1, 0:Nz-1)

complex*16 :: c(0:Nx/2, 0:Ny-1, 0:Nz-1)

complex*16 , parameter :: size1=dcmplx(0.0,0.0)

real*8 , parameter :: size2=dble(0.0)

type(C_PTR) :: cptr_cd, cptr_rd

complex*16 , dimension(:,:,:), pointer :: cd

real*8 , dimension(:,:,:), pointer :: rd

res = cudaMallocHost ( cptr_cd,NxNyNz*sizeof(size1) )

call c_f_pointer ( cptr_cd,cd , [Nx/2+1,Ny,Nz] )

res = cudaMallocHost ( cptr_rd, NxNyNz*sizeof(size2) )

call c_f_pointer ( cptr_rd,rd , [Nx,Ny,Nz ] )

cd = c

call fft_3d_c2r( Nx, Ny, Nz, cd ,rd)

r = rd

res = cudaFreeHost(cptr_rd)

res = cudaFreeHost(cptr_cd)

end subroutine c2r_3d

[/codebox]

3 ) and finally this is the .cu code

[codebox]

#include <stdio.h>

#include <stdlib.h>

#include “cufft.h”

#include “cuda.h”

#include “cutil_inline.h”

extern “C” void fft_3d_r2c_( int *nx ,int *ny, int *nz, cufftDoubleReal *h_a, cufftDoubleComplex *h_b)

{

unsigned long size1 = sizeof(cufftDoubleReal)* (nx)(ny)(*nz); // calulate memory size

    unsigned long size2 = sizeof(cufftDoubleComplex)* (*nx/2+1)*(*ny)*(*nz);                       //  calulate memory size

cufftDoubleReal *d_ic=NULL; // define pointer to device mem input

    cufftDoubleComplex *d_oc=NULL;                                                                 //  define pointer to device mem output

    cufftHandle plan;

cutilSafeCall( cudaMalloc((void **)&d_ic, size1)); // allocate mem on device

cutilSafeCall( cudaMalloc((void **)&d_oc, size2)); // allocate mem on device

cutilSafeCall( cudaMemcpy(d_ic,h_a, size1 , cudaMemcpyHostToDevice)); //copy host array to device

cufftSafeCall( cufftPlan3d(&plan, (*nz),(*ny),(*nx) ,CUFFT_D2Z )); // perform 3d fft complex to complex

cufftSafeCall( cufftExecD2Z(plan, d_ic, d_oc )); // out of place

cutilSafeCall( cudaMemcpy(h_b,d_oc, size2 , cudaMemcpyDeviceToHost)); // copy back to host array

cudaFree(d_ic);

    cudaFree(d_oc);

    cufftDestroy(plan);

}

extern “C” void fft_3d_c2r_( int* nx ,int *ny, int *nz, cufftDoubleComplex *h_a, cufftDoubleReal *h_b)

{

    unsigned long size1 = sizeof(cufftDoubleComplex)* (*nx)*(*ny)*(*nz);                           //  calulate memory size

    unsigned long size2 = sizeof(cufftDoubleReal)* (*nx)*(*ny)*(*nz);                              //  calulate memory size

    cufftDoubleComplex *d_ic=NULL;                                                                 //  define pointer to device mem input

    cufftDoubleReal *d_oc=NULL;                                                                    //  define pointer to device mem output

    cufftHandle plan;

cutilSafeCall( cudaMalloc((void **)&d_ic, size1)); // allocate mem on device

cutilSafeCall( cudaMalloc((void **)&d_oc, size2)); // allocate mem on device

cutilSafeCall( cudaMemcpy(d_ic,h_a, size1 , cudaMemcpyHostToDevice)); //copy host array to device

cufftSafeCall( cufftPlan3d(&plan, *nz,*ny,*nx ,CUFFT_Z2D )); // perform 3d fft complex to complex

cufftSafeCall( cufftExecZ2D(plan, d_ic, d_oc )); // out of place

cutilSafeCall( cudaMemcpy(h_b,d_oc, size2 , cudaMemcpyDeviceToHost)); // copy back to host array

cudaFree(d_ic);

    cudaFree(d_oc);

    cufftDestroy(plan);

}

[/codebox]

Note that only the latest gnufortran ( after aug , 2009) would work properly.

It is critical to swap the indices nx and nz in cufftPlan3d.