cufft/cublas wrappers for fortran how to create cufft wrappers?

Hello everyone.

I use fortran (Intel Fortran Compiler) for my scientific computations and recently I have started using CUDA in order to speed them up. I don’t have any problems with cublas, but in order to feed gemm with data I need to perform some fourier transforms. The next logical step is to calculate all the fourier transforms on GPU. Unfortunately there are no fortran wrappers for cufft provided with CUDA Toolkit. I have tried to write something similar to cublas fortran wrappers, but I failed, since I have very little knowledge of “C”.

So, first I have a question about cublas memory allocation and copying data functions (cublas_alloc, cublas_(s/g)et_vector). Would it be possible to use them in order to feed data to cufft functions? (I think it would not constitute a problem, but I’d like to hear confirmation from somebody more experienced).

Also, maybe someone already tried to write wrappers of this kind and can share? What I only need is cufft_plan1d, cufft_destroy and cufftExecZ2Z. For simplicity I would like to use it altogether with cublas fortran wrappers.

As I said I have tried to do this on my own, but to be honest I don’t know how to handle “plan” (cufftHandle type). In the code below I have left out almost all parts concerning plan (not to leave blanks I’ve put “cufftHandle_plan” in those places), since everything I’ve tried was obviously wrong and ended in sigsegv.


int CUFFT_EXECZ2Z (_cufftHandle_plan_, const devptr_t *idata, devptr_t *odata,const int *direction)


    cuComplex *i = (cuComplex *)(*idata);

    cuComplex *o = (cuComplex *)(*odata);

    return (int)cufftExecZ2Z(_cufftHandle_plan_, i,o,*direction);


int CUFFT_PLAN1D (_cufftHandle_plan_, const int *nx, const int *type, const int *count)


    return (int) cufftPlan1d(_cufftHandle_plan_, *nx, *type, *count);


int CUFFT_DESTROY (unsigned int *plan)


    return (int) cufftDestroy (plan);



int CUFFT_PLAN1D (unsigned *plan, const int *nx, const int *type, const int *count);

int CUFFT_DESTROY (const unsigned int *plan);

int CUFFT_EXECZ2Z (const unsigned int *plan, const devptr_t *idata, devptr_t *odata,const int *direction);

I would be grateful for any help, but please take into account in your explanations that I’m not that much familiar with “C”.

I wrote a CUFFT wrapper for CUDA Fortran some time ago, but you should be able to adapt them to the Intel compiler:

Thank you for reply. I have tried to compile your wrapper before I started this topic, but the allocation of device memory is compiler specific there, so I would have to make many modifications. Instead I wrote single “C” function to which I pass 1D input/output arrays (input array contains N functions to transform) and call cufft routines from there. Unfortunately my data are zero-padded (about 8 times more zeros than actual data) and calculations and memory transfers take quite long time. Furthermore I need only part of the output results, so calculating the fourier transform from definition (which I perform with gemm) is two times faster in my case (but still slower than using fft on CPU).

As I can see fft routines which do not need zero-padded input in order to achieve higher sampling in frequency domain are almost nonexistent…

Don’t transfer zero data and add padding on the GPU.

If you have N functions to transform, each one of length M, stored on the CPU in a matrix A_cpu(M,N)
you can define a matrix A_gpu(Mp,N) where Mp is the new padded length,

You can either:

  1. zero A_gpu and use cudaMemcpy2D to copy A_cpu in A_gpu, since you can define a stride.


  1. transfer A_cpu to a temporary array on GPU and then use a custom kernel to fill A_gpu

Once you have A_gpu, use the standard CUFFT and then transfer the results back ( if you need the full range use cudaMemcpy otherwise you can select a subset with cudaMemcpy2D)