Trouble with cuFFT on multiple GPUs


I am trying to use GPUs for direct numerical simulation of fluid flow, and one of the things I need to accomplish is a 3D FFT of a large set of data (1024^3 hopefully). I have worked with cuFFT quite a bit for smaller cases that fit on a single GPU, but I am now trying to expand the resolution which will require the memory of multiple GPUs.

I have written some sample code (below) to take the forward and inverse FFT of a function as a simple test. I tried to follow the NVidia sample code simplecufft_2d_mgpu as closely as possible for the first test, but instead of creating memory on the host and transferring it to the GPUs, I am using managed memory arrays, but still performing the same memory copies to make sure the data is aligned properly.

I seem to have gotten the code working for the smaller domain sizes, from 64^3 to 256^3, but when I attempt to go to 512^3, or preferably, 1024^3, I get an error in the cuFFTXtSetGPU function.

Usually I have a good idea of what I can do as far as troubleshooting goes, but for this one I am stumped. I am hoping that someone might be able to take my code and run it to see if they have the same issues, or if anyone has come across a similar problem before. I have pasted the code below, but it is also available on my github at Note that the simulation size can be changed in lines 21-23 which define the number of grid points used.

Thanks in advance for the help!

Edit: I forgot to mention that I am using CUDA 8.0 on two different setups: one setup is using a TitanXp as GPU0 and a Quadro K2000 as GPU1. On this setup, I receive the *XtSetGPU error when trying a 512^3 simulation. The other setup is on Amazon Web Services (p2.8xlarge instance) which has two Tesla K80’s. In this case, I can run the 512^3 case, but get the same *XtSetGPU error when attempting to run at 1024^3.

// Multiple GPU version of cuFFT_check that uses multiple GPU's
// This program creates a real-valued 3D function sin(x)*cos(y)*cos(z) and then 
// takes the forward and inverse Fourier Transform, with the necessary scaling included. 
// The output of this process should match the input function

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <complex.h>

// includes, project
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <cufft.h>
#include <cuComplex.h>
//CUFFT Header file
#include <cufftXt.h>

#define NX 512
#define NY 512
#define NZ 512
#define NZ2 (NZ/2+1)
#define NN (NX*NY*NZ)
#define L (2*M_PI)
#define TX 8
#define TY 8
#define TZ 8

int divUp(int a, int b) { return (a + b - 1) / b; }

int idxClip(int idx, int idxMax){
    return idx > (idxMax - 1) ? (idxMax - 1) : (idx < 0 ? 0 : idx);

int flatten(int col, int row, int stack, int width, int height, int depth){
    return idxClip(stack, depth) + idxClip(row, height)*depth + idxClip(col, width)*depth*height;
    // Note: using column-major indexing format

void initialize(int NX_per_GPU, int gpuNum, cufftDoubleComplex *f1, cufftDoubleComplex *f2)
    const int i = blockIdx.x * blockDim.x + threadIdx.x;
    const int j = blockIdx.y * blockDim.y + threadIdx.y;
    const int k = blockIdx.z * blockDim.z + threadIdx.z;
    if ((i >= NX_per_GPU) || (j >= NY) || (k >= NZ)) return;
    const int idx = flatten(i, j, k, NX, NY, NZ);

    // Create physical vectors in temporary memory
    double x = i * (double)L / NX + (double)gpuNum*NX_per_GPU*L / NX;
    double y = j * (double)L / NY;
    double z = k * (double)L / NZ;

    // Initialize starting array
    f1[idx].x = sin(x)*cos(y)*cos(z);
    f1[idx].y = 0.0;

    f2[idx].x = 0.0;
    f2[idx].y = 0.0;


void scaleResult(int NX_per_GPU, cufftDoubleComplex *f)
    const int i = blockIdx.x * blockDim.x + threadIdx.x;
    const int j = blockIdx.y * blockDim.y + threadIdx.y;
    const int k = blockIdx.z * blockDim.z + threadIdx.z;
    if ((i >= NX_per_GPU) || (j >= NY) || (k >= NZ)) return;
    const int idx = flatten(i, j, k, NX, NY, NZ);

    f[idx].x = f[idx].x / ( (double)NN );
    f[idx].y = f[idx].y / ( (double)NN );


int main (void)
    int i, j, k, idx, NX_per_GPU;
    // double complex test;

    // Set GPU's to use and list device properties
    int nGPUs = 2, deviceNum[nGPUs];
    for(i = 0; i<nGPUs; ++i)
        deviceNum[i] = i;


        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, deviceNum[i]);
        printf("  Device name: %s\n",;
        printf("  Memory Clock Rate (KHz): %d\n",
        printf("  Memory Bus Width (bits): %d\n",
        printf("  Peak Memory Bandwidth (GB/s): %f\n\n",

    printf("Running Multi_GPU_FFT_check using %d GPUs on a %dx%dx%d grid.\n",nGPUs,NX,NY,NZ);

    // Initialize input data
    // Split data according to number of GPUs
    NX_per_GPU = NX/nGPUs;              // This is not a good solution long-term; needs more work for arbitrary grid sizes/nGPUs

    // Declare variables
    cufftDoubleComplex *u;
    cufftDoubleComplex *u_fft;

    // Allocate memory for arrays
    cudaMallocManaged(&u, sizeof(cufftDoubleComplex)*NN );
    cudaMallocManaged(&u_fft, sizeof(cufftDoubleComplex)*NN );
    // Launch CUDA kernel to initialize velocity field
    const dim3 blockSize(TX, TY, TZ);
    const dim3 gridSize(divUp(NX_per_GPU, TX), divUp(NY, TY), divUp(NZ, TZ));
    for (i = 0; i<nGPUs; ++i){
        int idx = i*NX_per_GPU*NY*NZ;                // sets the index value of the data to send to each gpu
        initialize<<<gridSize, blockSize>>>(NX_per_GPU, deviceNum[i], &u[idx], &u_fft[idx]);

    // Synchronize both GPUs before moving forward
    for (i = 0; i<nGPUs; ++i){

    // Initialize CUFFT for multiple GPUs //
    // Initialize result variable used for error checking
    cufftResult result;

    // Create empty plan that will be used for the FFT
    cufftHandle plan;
    result = cufftCreate(&plan);
    if (result != CUFFT_SUCCESS) { printf ("*Create failed\n"); return 1; }

    // Tell cuFFT which GPUs to use
    result = cufftXtSetGPUs (plan, nGPUs, deviceNum);
    if (result != CUFFT_SUCCESS) { printf ("*XtSetGPUs failed: code %i\n", result); return 1; }

    // Create the plan for the FFT
    size_t *worksize;                                   // Initializes the worksize variable
    worksize =(size_t*)malloc(sizeof(size_t) * nGPUs);  // Allocates memory for the worksize variable, which tells cufft how many GPUs it has to work with
    // Create the plan for cufft
    result = cufftMakePlan3d(plan, NX, NY, NZ, CUFFT_Z2Z, worksize);
    if (result != CUFFT_SUCCESS) { printf ("*MakePlan* failed: code %d \n",(int)result); exit (EXIT_FAILURE) ; }

    printf("The size of the worksize is %lu\n", worksize[0]);

    // Initialize transform array - to be split among GPU's and transformed in place using cufftX
    cudaLibXtDesc *u_prime;
    // Allocate data on multiple gpus using the cufft routines
    result = cufftXtMalloc(plan, (cudaLibXtDesc **)&u_prime, CUFFT_XT_FORMAT_INPLACE);
    if (result != CUFFT_SUCCESS) { printf ("*XtMalloc failed\n"); exit (EXIT_FAILURE) ; }

    // Copy the data from 'host' to device using cufftXt formatting
    result = cufftXtMemcpy(plan, u_prime, u, CUFFT_COPY_HOST_TO_DEVICE);
    if (result != CUFFT_SUCCESS) { printf ("*XtMemcpy failed, code: %d\n",result); exit (EXIT_FAILURE); }

    // Perform FFT on multiple GPUs
    printf("Forward 3d FFT on multiple GPUs\n");
    result = cufftXtExecDescriptorZ2Z(plan, u_prime, u_prime, CUFFT_FORWARD);
    if (result != CUFFT_SUCCESS) { printf ("*XtExecZ2Z  failed\n"); exit (EXIT_FAILURE); }

////////// Apparently re-ordering the data prior to the IFFT is not necessary (gives incorrect results)////////////////////
    // cudaLibXtDesc *u_reorder;
    // result = cufftXtMalloc(plan, (cudaLibXtDesc **)&u_reorder, CUFFT_XT_FORMAT_INPLACE);
    // if (result != CUFFT_SUCCESS) { printf ("*XtMalloc failed\n"); exit (EXIT_FAILURE) ; }
    // // Re-order data on multiple GPUs to natural order
    // printf("Reordering the data on the GPUs\n");
    // result = cufftXtMemcpy (plan, u_reorder, u_prime, CUFFT_COPY_DEVICE_TO_DEVICE);
    // if (result != CUFFT_SUCCESS) { printf ("*XtMemcpy failed\n"); exit (EXIT_FAILURE); }

    // Perform inverse FFT on multiple GPUs
    printf("Inverse 3d FFT on multiple GPUs\n");
    result = cufftXtExecDescriptorZ2Z(plan, u_prime, u_prime, CUFFT_INVERSE);
    if (result != CUFFT_SUCCESS) { printf ("*XtExecZ2Z  failed\n"); exit (EXIT_FAILURE); }

    // Copy the output data from multiple gpus to the 'host' result variable (automatically reorders the data from output to natural order)
    result = cufftXtMemcpy (plan, u_fft, u_prime, CUFFT_COPY_DEVICE_TO_HOST);
    if (result != CUFFT_SUCCESS) { printf ("*XtMemcpy failed\n"); exit (EXIT_FAILURE); }

    // Scale output to match input (cuFFT does not automatically scale FFT output by 1/N)
    for (i = 0; i<nGPUs; ++i){
        idx = i*NX_per_GPU*NY*NZ;                // sets the index value of the data to send to each gpu
        scaleResult<<<gridSize, blockSize>>>(NX_per_GPU, &u_fft[idx]);

    // Synchronize GPUs
    for (i = 0; i<nGPUs; ++i){

    // Test results to make sure that u = u_fft
    double error = 0.0;
    for (i = 0; i<NX; ++i){
        for (j = 0; j<NY; ++j){
            for (k = 0; k<NZ; ++k){
                idx = k + j*NZ + NZ*NY*i;
                // error += (double)u[idx].x - sin(x)*cos(y)*cos(z);
                error += (double)u[idx].x - (double)u_fft[idx].x;
                // printf("At idx = %d, the value of the error is %f\n",idx,(double)u[idx].x - (double)u_fft[idx].x);
                // printf("At idx = %d, the value of the error is %f\n",idx,error);

    printf("The sum of the error is %4.4g\n",error);

    // Deallocate variables

    // Free malloc'ed variables
    // Free cuda malloc'ed variables
    // Free cufftX malloc'ed variables
    result = cufftXtFree(u_prime);
    if (result != CUFFT_SUCCESS) { printf ("*XtFree failed\n"); exit (EXIT_FAILURE); }
    // result = cufftXtFree(u_reorder);
    // if (result != CUFFT_SUCCESS) { printf ("*XtFree failed\n"); exit (EXIT_FAILURE); }
    // Destroy FFT plan
    result = cufftDestroy(plan);
    if (result != CUFFT_SUCCESS) { printf ("cufftDestroy failed: code %d\n",(int)result); exit (EXIT_FAILURE); }

    return 0;


What is the exact error code returned by cuFFTXtSetGPU and the error description associated with this error code? Surely that should provide some clue? Your description suggests this could be an “out of memory” scenario.

Hi njuffa,

Thanks for the response, and my apologies for not specifying this in the first thread.

The value of the result variable after the error occurs is 5, but to be honest I’m not actually sure which error that is associated with (the documentation in the toolkit lists the names of the errors, but I’m not sure how to find out the numerical value associated with them).

My first thought was that perhaps it was an out of memory issue, but the input for the cufftXtSetGPU function doesn’t seem to rely on the size of the data - the inputs are the pointer for the plan (which has been initialized but not created yet), the number of GPUs and the array of GPU numbers. Perhaps this function reserves some memory on the GPU behind the scenes, which is causing the initialization to fail?

It’s right there in the header file. Function prototype: cufftResult CUFFTAPI cufftXtSetGPUs(cufftHandle handle, int nGPUs, int *whichGPUs); The type cufftResult is an enum, with: CUFFT_INTERNAL_ERROR = 0x5

Internal error is of course an opaque reason. I do not know how CUFFT’s error reporting was designed, I would not expect an internal error to be reported when it finds itself out of memory during an internal allocation. In many cases, internal errors seem worth reporting as a bug. Even if the failure were down to a usage issue, this would indicate that a better diagnostic message is in order, as “internal error” is not actionable information for the programmer.

No argument with anything indicated by njuffa. I haven’t looked at your test case closely.

But one observation I have is that in the cufft documentation:

it states:

“All GPUs must have the same CUDA architecture level.”

So I would say your first test case is invalid.

Thanks for the replies. I realized my mistake on the error reporting - I had incorrectly thought that each API had a separate listing of error codes, without realizing that the error codes were defined for all of cuFFT. Sorry for asking a stupid question!

I was aware that there may be issues using two drastically different GPUs, but was pleasantly surprised that the code worked for everything up to the 512^3 case. That is the primary reason I switched to the second test case with the Tesla K80’s to provide a more solid reference.

I will submit a bug request for this error - thanks for the feedback.

If I might ask a more general question, my hope is to eventually perform FFTs on data that is too large to fit on a single GPU - is this functionality supported by cuFFT? My hope is that by spreading the data over multiple (up to 8) GPUs, I could perform an FFT on data that won’t fit on a single card. I can’t seem to find many examples of people using cuFFT in this manner, however.

Thanks for your time!

I am not aware of any out-of-core FFT capability in CUFFT. Given that GPUs sport 16 GB of on-board memory these days, what use cases exist for FFTs that cannot fit into that amount of memory?

I am trying to investigate the use of GPUs for high resolution Direct Numerical Simulation of fluid flow. This involves the computation and tracking of ~12 variables (such as fluid velocity) through time. For a high resolution case (1024^3), each double precision variable requires ~8.4 GB of memory by my calculation. Ideally, I would like to run a simulation with 2048^3 grid points, which would exceed the standard 16 GB of memory.

These simulations are fairly standard practice, but they are traditionally done on supercomputers using MPI. One of my big questions is to explore the feasibility of performing these calcs on GPUs, because I think the performance increase could be substantial.

Thanks for the explanation. Through a quick literature search, I have since found other examples of FFTs requiring > 16 GB of memory, from seismic analysis and radio-astronomy. I guess there hasn’t been high enough demand for such large FFTs to make it worthwhile for NVIDIA to invest into developing multi-GPU FFT capability. So currently you’d be limited by the 24 GB capacity of a Quadro P6000.

I’m reasonably sure the cufftXt api supports transforms that are larger than what can fit on a single GPU. As njuffa states, there is no “out of core” capability where the transform can exceed the aggregate size of memory on all GPUs combined. Again, referring to the doc:

“The data for the entire transform must fit within the memory of the GPUs assigned to it.”

This doesn’t spell out exactly what the upper limit is for a collection of GPUs, however. If you have a collection of 2 GPUs each with 12GB of memory, you should be able to perform transforms where the data is larger than 12GB but smaller than 24GB, but I can’t give you an upper bound that is better than that.

Forget about cuFFTXt and use MPI.

If you are doing DNS, you can find working GPU codes for isotropic turbulence at:

and for channel flow at:

Thanks txbob for the clarification - I had read that section of the User’s Guide but couldn’t convince myself that it meant the aggregate size of the GPUs in memory, especially after I couldn’t seem to get my sample code working.

mfatica - Thanks for the links to the DNS code. Those codes are very similar to what I am trying to do, so I’m sure that will be useful. I’m not sure if you are familiar with those codes, but I just glanced over it (very briefly) and I don’t understand why MPI is useful here. It seems that it relies on MPI to transfer data between GPUs - but with peer-to-peer access, it seems that doing that without MPI might actually be more efficient? Sorry if that’s a stupid question, just trying to understand better.

If you rely on MPI instead of peer-to-peer, you have a scalable code that can run very large grids ( if I remember correctly we did some 4096^3 simulations with these codes on thousands of GPUs). It is more work, but it pays off. Peer-to-peer can also give suboptimal performances on certain systems. These two codes use a slab decomposition.

There is another channel code we worked on , that uses a pencil decomposition ( book-keeping becomes more complex) that may be of interest to you, that can also simulate Rayleigh-Benard flows. It is in CUDA Fortran ( PGI has now a free community edition compiler) and the source is available here:

The porting is described in this paper (

Thanks for the information, mfatica. That is really interesting!

I have been working under the assumption that the simulations would be running on a single node with many GPUs (up to 16) connected to a single CPU host, so I hadn’t considered using MPI for communication between nodes.