cufftXt batch 1D

I have very large 2D arrays (occupying over 60 GB on disk) in which I have to perform 1D fft’s column by column and I have at my disposal as much as 8 gpus connected by PCIE. The size of the transform is small (although not power of 2) and I just have to take a lot of them.

I have tested a program using one GPU in which I sequentially: read data, copy htod, execute batch 1d fft, copy dtoh and write results on disk. I iterate on the previous steps until I cover all the file. No issues there and the results are okay. Now, I would like to exploit all the gpu resources I have in order to speed up these calculations and I started looking at different options.

I am taking the example provided in the documentation to calculate 1D ffts on multiple gpus, just slightly modified so that it can compile.

// Demonstrate how to use CUFFT to perform 1-d FFTs using 2 GPUs 
// Output on the GPUs is in natural output
// Function return codes should be checked for errors in actual code
//
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

#include <string.h>
#include <math.h>

// CUDA runtime
#include <cuda_runtime.h>

//CUFFT Header file
#include <cufftXt.h>

int main(){
// cufftCreate() - Create an empty plan
    cufftHandle plan_input; cufftResult result;
    result = cufftCreate(&plan_input);
//
// cufftXtSetGPUs() - Define which GPUs to use
    int nGPUs = 2, whichGPUs[2];
    whichGPUs[0] = 0; whichGPUs[1] = 1;
    result = cufftXtSetGPUs (plan_input, nGPUs, whichGPUs);
//
// Initialize FFT input data
    size_t worksize[2];
    cufftComplex *host_data_input, *host_data_output;
    int nx = 16384, batch = 2;

    int size_of_data = sizeof(cufftComplex) * nx * batch;
    cudaMallocHost((void**)&host_data_input, size_of_data * sizeof(cufftReal));
    cudaMallocHost((void**)&host_data_output, size_of_data * sizeof(cufftReal));


    int rank = 1;                           // --- 1D FFTs
    int n[] = { nx };                 // --- Size of the Fourier transform
    int istride = 1, ostride = 1;           // --- Distance between two successive input/output elements
    int idist = nx, odist = (nx); // --- Distance between batches
    int inembed[] = { 0 };                  // --- Input size with pitch (ignored for 1D transforms)
    int onembed[] = { 0 };                  // --- Output size with pitch (ignored for 1D transforms)


    for (int i = 0; i< nx*batch; i++){
        host_data_input[i].x = i*i;
        host_data_input[i].y = i;
        //printf("input host %f \n", host_data_input[i].x);
    }
//cufftResult 
    cufftGetSizeMany(plan_input, rank, n, inembed, istride, idist, 
        onembed, ostride, odist, CUFFT_C2C, batch, worksize);

    printf("Work size area is %lu and %lu\n", worksize[0], worksize[1]);

// cufftMakePlanMany() - Create the plan
    result = cufftMakePlanMany (plan_input, rank, n, inembed, istride, idist,
        onembed, ostride, odist, CUFFT_C2C, batch, worksize);
//
// cufftXtMalloc() - Malloc data on multiple GPUs
    cudaLibXtDesc *device_data_input, *device_data_output;
    result = cufftXtMalloc (plan_input, &device_data_input,
        CUFFT_XT_FORMAT_INPLACE);
    result = cufftXtMalloc (plan_input, &device_data_output,
        CUFFT_XT_FORMAT_INPLACE);
//
// cufftXtMemcpy() - Copy data from host to multiple GPUs
    result = cufftXtMemcpy (plan_input, device_data_input,
        host_data_input, CUFFT_COPY_HOST_TO_DEVICE);
//
// cufftXtExecDescriptorC2C() - Execute FFT on multiple GPUs
    result = cufftXtExecDescriptorC2C (plan_input, device_data_input,
        device_data_input, CUFFT_FORWARD);
//
// cufftXtMemcpy() - Copy the data to natural order on GPUs
    result = cufftXtMemcpy (plan_input, device_data_output,
        device_data_input, CUFFT_COPY_DEVICE_TO_DEVICE);
//
// cufftXtMemcpy() - Copy natural order data from multiple GPUs to host
    result = cufftXtMemcpy (plan_input, host_data_output,
        device_data_output, CUFFT_COPY_DEVICE_TO_HOST);
//
// Print output and check results
    for (int i = 0; i<100; i++){
        printf("host output data i: %d, real %f, complex %f \n", i, host_data_output[i].x, host_data_output[i].y);
    }

//
// cufftXtFree() - Free GPU memory
    result = cufftXtFree(device_data_input);
    result = cufftXtFree(device_data_output);
//
// cufftDestroy() - Destroy FFT plan
    result = cufftDestroy(plan_input);

    return 0;
}

I notice the following:
a) if I increment the batch size to anything greater than 1 it outputs zeros.
b) with batch = 1, if the transform size is not power of 2 it outputs zeros.

Is there something wrong in my code? Or cuffXt simply won’t work with fft batch sizes larger than 1? I have run cuda-memcheck every time and it reports no errors although the output is zero. Will there be any speed-up in doing 1D batch fft across multiple gpus or is not worth exploring it?

Thanks!

I don’t have a complete answer but it looks like (from the profiler) that the device-to-device XtMemcpy is not actually happening.

What I observe is that if I copy directly from the in-place buffer, to the host results buffer, I get non-zero data:

// cufftXtMemcpy() - Copy natural order data from multiple GPUs to host
    result = cufftXtMemcpy (plan_input, host_data_output,
        device_data_input, CUFFT_COPY_DEVICE_TO_HOST);  //device_data_input instead of device_data_output

For the batch=2 case with 2 GPUs, its evident from the profiler that it is sending one fft from the batch to each GPU. For large enough FFT work, this should be faster than doing it on a single GPU.

Thanks for your answer Robert, that did the trick! So far, I’m getting non-zero data whether the transform signal is power of two or not.

I will check how it performs with large enough batches and more than 2 GPU’s.

Just as a follow up. I noticed that in my particular scenario, according to nvvp, it takes about the same time to read my input data from disk (and convert it from uint_16t to float) than it takes to the execute the batch including the data transfers both directions.

I thought I can have a separate cpu thread reading and preparing the data while the GPU is executing the transform so that when the GPU has finished, the input data is ready and the GPU can proceed with the next batch, minimizing the idle time between batches.

According to the linux man page for pthread_create (http://man7.org/linux/man-pages/man3/pthread_create.3.html), if I want to pass some arguments to the thread (i.e. the pointer to the read buffer and the pointer to host_input_data) these need to be encapsulated in a structure. In my case I do

struct thread_data {
    int nx; //transform size
    int batch;      //number of transforms 
    FILE *fid_in;  //input file
    uint16_t* buffer; //read buffer
    cufftComplex *host_data_input;
};

and then I have a read&conversion function

void *my_read(void *args){

    thread_data *info = (thread_data *) &args;

    fread(info->buffer, sizeof(uint16_t), info->nx*info->batch, info->fid_in);

    for (int i=0; i<info->batch; i++){
       for (int j=0; j<info->nx; j++){
           info->host_data_input[i*info->nx + j].x = (cufftReal) info->buffer[i*info->nx + j];
           info->host_data_input[i*info->nx + j].y = (cufftReal) 0.f;
        }
    }
        
    return NULL;
}

reusing the code in my first post and skipping the parts not relevant to this, the for loop over multiple batched transforms looks like this:

thread_data data;
    data.batch = batch_max;
    data.fid_in = fid_in;
    data.nx = nx;

// Raw data read buffer allocation
    data.buffer = (uint16_t *)malloc(sizeof(uint16_t)*batch_max*nx);

// cufftCreate() - Create an empty plan
    cufftHandle plan_input; cufftResult result;
    result = cufftCreate(&plan_input);

// cufftXtSetGPUs() - Define which GPUs to use
    int whichGPUs[nGPUs];
    whichGPUs[0] = 0; whichGPUs[1] = 1;
    result = cufftXtSetGPUs (plan_input, nGPUs, whichGPUs);

// Initialize FFT input data
    size_t worksize[nGPUs];
    cufftComplex *host_data_input, *host_data_output;

    int size_of_data = sizeof(cufftComplex) * nx * batch_max;
    cudaMallocHost((void**)&data.host_data_input, size_of_data * sizeof(cufftReal));
    cudaMallocHost((void**)&host_data_output, size_of_data * sizeof(cufftReal));

    int rank = 1;                           // --- 1D FFTs
    int n[] = { nx };                 // --- Size of the Fourier transform
    int istride = 1, ostride = 1;           // --- Distance between two successive input/output elements
    int idist = nx, odist = (nx); // --- Distance between batches
    int inembed[] = { 0 };                  // --- Input size with pitch (ignored for 1D transforms)
    int onembed[] = { 0 };                  // --- Output size with pitch (ignored for 1D transforms)

//cufftResult 
    cufftGetSizeMany(plan_input, rank, n, inembed, istride, idist, 
        onembed, ostride, odist, CUFFT_C2C, batch_max, worksize);

    for (int i=0; i<nGPUs; i++){
        printf("Work size area is %lu\n", worksize[i]);
    }
    
// cufftMakePlanMany() - Create the plan
    result = cufftMakePlanMany (plan_input, rank, n, inembed, istride, idist,
    onembed, ostride, odist, CUFFT_C2C, batch_max, worksize);
    printf("plan result %d \n", result);

// cufftXtMalloc() - Malloc data on multiple GPUs
    cudaLibXtDesc *device_data_input, *device_data_output;
    result = cufftXtMalloc (plan_input, &device_data_input,
            CUFFT_XT_FORMAT_INPLACE);

    result = cufftXtMalloc (plan_input, &device_data_output,
            CUFFT_XT_FORMAT_INPLACE);

// Init - Read raw data
    int batch = batch_max;
    fread(data.buffer, sizeof(uint16_t), data.nx*data.batch, data.fid_in);

// Convert from 16uint to single-precision float
    for (int i=0; i<batch; i++){
        for (int j=0; j<nx; j++){
            data.host_data_input[i*nx + j].x = (cufftReal) data.buffer[i*nx + j];
            data.host_data_input[i*nx + j].y = (cufftReal) 0.f;
        }
    }

    for (int ibatch = 0; ibatch < nbatch; ibatch++){ 
    //nbatch calculated elsewhere to cover the entire file opened in fid_in

        if(ibatch==nbatch-1){
        //adjust size if last run of batches    
            long sz_current = ftell(fid_in);
            batch = (sz-sz_current)/2/nx;
        }

       //if not the first iteration, wait for the reader thread to finish
        if (ibatch > 0){
            printf("Waiting for thread \n");
            pthread_join(thread_id, NULL);
        }
        
        // cufftXtMemcpy() - Copy data from host to multiple GPUs
        result = cufftXtMemcpy (plan_input, device_data_input,
            data.host_data_input, CUFFT_COPY_HOST_TO_DEVICE);
        //
        //
        cudaDeviceSynchronize(); //wait for the copy to finish before overwriting the host input buffer


        pthread_create(&thread_id, NULL, my_read, (void *) &data);



        // cufftXtExecDescriptorC2C() - Execute FFT on multiple GPUs
        result = cufftXtExecDescriptorC2C (plan_input, device_data_input,
            device_data_input, CUFFT_FORWARD);

        // cufftXtMemcpy() - Copy natural order data from multiple GPUs to host
        result = cufftXtMemcpy (plan_input, host_data_output,
            device_data_input, CUFFT_COPY_DEVICE_TO_HOST);
        //

        printf("Copying done\n");
        // Print output and check results
        for (int i = 0; i<10; i++){
            printf("host output data i: %d, real %f, complex %f \n", i, host_data_output[i].x, host_data_output[i].y);
        }

        //fwrite(host_data_output, sizeof(cufftComplex), (nx/2 + 1) * batch, fid_out);

}

At runtime I get the following error when the execution reaches *my_read:

========= Error: process didn’t terminate successfully
========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors.
========= No CUDA-MEMCHECK results found

I was somehow expecting this based on a few other errors I had encountered before. Is there a way to pass the host_input_data to the thread without causing this error? Is there any other approach to decoupling the I/O from/to disk from the actual GPU execution so that the GPU is used most of the time, or that at least the data is ready when the GPU has finished?

Thanks

I’m pretty sure the ampersand here is incorrect:

thread_data *info = (thread_data *) &args;
                                    ^

You have something that is already a pointer, taking the address of it (so creating a double-pointer), then casting that back to a single pointer type. That is unlikely to be correct in C or C++ programming no matter what your use-case.

This has nothing to do with CUDA programming. You might want to study pthread example codes. Or rip all the CUDA out of your code, and just get the pthread stuff working.

I am aware that it does deviate slightly from CUDA but my underlying question is still related and is basically: how can pass the host buffer to another thread without getting the dereferenced pointer error? in order to parallelize data reading and execution. Most of what is out there is about overlapping mem copies with kernel executions but not much about feeding the buffers or subsequently, the kernels.

After further reading, I came across this article [url]https://devblogs.nvidia.com/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/[/url], since they somehow touch on what I am trying to achieve I thought it will be worth to give it a try. Before diving into it, I still have the question about how to deal with multi-gpu fft’s and callbacks.

The documentation states that “For multi-GPU transforms, the index passed to the callback routine is the element index from the start of data on that GPU, not from the start of the entire input or output data array.”. How can I pass the index to the data start on each GPU if I don’t know exactly how the data is distributed among the GPU’s?

Are the cudaMemcpyFromSymbol (given that I’m not using unified memory I assume I can’t use managed in my cufft callback) and cufftXtSetCallback calls enough to make sure the load call is ran in every device? Maybe a code sample would be helpful, if possible.

cufft callbacks have nothing to do with loading data to/from disk, and will not help you with loading data to/from disk, nor will they help with overlap of FFT processing and disk data loading.

On a fast GPU, FFT processing can proceed on a pace that is measured in GB/s. Therefore if you are streaming data from an “ordinary” disk interface that delivers data at some lower rate than the processing speed of your GPU, this will be the limiting factor in how fast you can process FFTs.

To make sure that you can process FFTs as fast as the disk can deliver them for this rate, it should only be necessary to overlap FFT processing and disk loading, and for that simple case we need not use threads. CUDA activity can proceed asynchronously to host activity, so it is only necessary to issue the work correctly. Your original work issuance scheme:

read data, copy htod, execute batch 1d fft, copy dtoh and write results on disk

if processed like that in a loop, would not allow overlap of asynchronous CPU (data loading) and GPU (FFT) activity.

If we use a double-buffered scheme, we can decouple the two activities and overlap them, without the use of threads. Here is a worked example:

$ cat t1528.cpp
#include <cufft.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <fstream>      // std::ifstream
#include <cuda_runtime_api.h>
#include <cuda_profiler_api.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

#define N 1024

int main()
{
    float *r1, *r2, *temp;
    cudaHostAlloc((void **)&r1, sizeof(float2) * N * ((N/2)+1), cudaHostAllocDefault);
    cudaHostAlloc((void **)&r2, sizeof(float2) * N * ((N/2)+1), cudaHostAllocDefault);
    cufftComplex *d_r;
    cudaMalloc((void **)&d_r, sizeof(float2) * N * ((N/2)+1));
    std::ifstream in_file ("data.bin", std::ios::in | std::ios::binary);
    float *r1_pp = r1;
    float *r2_pp = r2;
    cudaStream_t str;
    cudaStreamCreate(&str);
    cufftHandle plan;
    cufftResult cstat = cufftPlan2d(&plan, N, N, CUFFT_R2C);
    if (cstat != CUFFT_SUCCESS) {std::cout << "CUFFT error1: " << (int)cstat << std::endl; return 0;}
    cstat = cufftSetStream(plan, str);
    if (cstat != CUFFT_SUCCESS) {std::cout << "CUFFT error2: " << (int)cstat << std::endl; return 0;}
    in_file.read((char *)r1, N*N*sizeof(float));
    int count = 0;
    long long dt = dtime_usec(0);
    while (in_file.good()){
#ifndef SKIP_FFT
      cudaMemcpyAsync(d_r, r1_pp, sizeof(float) * N * N, cudaMemcpyHostToDevice, str);
      cstat = cufftExecR2C(plan, (cufftReal *)d_r, d_r);
      if (cstat != CUFFT_SUCCESS) {std::cout << "CUFFT error3: " << (int)cstat << std::endl; return 0;}
      cudaMemcpyAsync(r1_pp, d_r, sizeof(float2) * N * ((N/2)+1), cudaMemcpyDeviceToHost, str);
#endif
      in_file.read((char *)r2_pp, N*N*sizeof(float));
      cudaStreamSynchronize(str);
      count++;
      temp  = r1_pp;
      r1_pp = r2_pp;
      r2_pp = temp;}
    dt = dtime_usec(dt);
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {std::cout << "CUDA error: " << cudaGetErrorString(err) << std::endl; return 0;}
    in_file.close();
    std::cout << "FFTs: " << count << std::endl;
    std::cout << "time: " << dt/(float)USECPSEC << "s" << std::endl;
    return 0;
}
$ g++ -I/usr/local/cuda/include t1528.cpp -o t1528 -L/usr/local/cuda/lib64 -lcufft -lcudart
$ ./t1528
FFTs: 32
time: 0.023965s
$ g++ -I/usr/local/cuda/include t1528.cpp -o t1528 -L/usr/local/cuda/lib64 -lcufft -lcudart -DSKIP_FFT
$ ./t1528
FFTs: 32
time: 0.024715s
$

CentOS 7, CUDA 10.1.243, Tesla V100

This assumes the existence of a data.bin file that is of an appropriate size. Mine happens to be ~136MB which supports 32 4MB FFTs.

On my system, we see that with or without the GPU activity, the loop processing duration is essentially the same, meaning that the loop is proceeding at the rate that the file system can supply data. This particular rate appears to be ~128MB/0.024s = 5.3GB/s

If you are streaming data from a high performance file system, then it may enter the domain of a “high-ingest-rate problem”. That domain would require more careful application design (and perhaps multiple GPUs along with an appropriate system/PCIE topology) to keep up with the file system.

Sorry for not being more clear in my previous post. I should have pointed out that I intended to achieve two different things:

a) converting my input data from 16 bit integer to float, prior to the FFT, in a more efficient manner rather than a for loop
b) decoupling the reading operation from the FFT execution

The post’s link and the use of a load callback would serve the purpose of a). Additionally, I was also asking how to achieve b), to which your answer has been very helpful. Just to give you some feedback, a 8 GB file takes about 2.8 s to process on a GTX 970 and about 1.5 s on a GTX 1080, both running on Ubuntu 18.04, CUDA 10.1. These numbers match the rates you mentioned.

The bottleneck I am experiencing is actually in saving these results on a file and not the calculations themselves. To save 16 GB of data my program takes at least 30 seconds (that’s ~530 MB/s on an SSD which isn’t bad), and as far as I know there is no way around it and the writing operation is inherently “slow”. Neither 2.8 nor 1.5 s are significant next to 30 s. So for this particular application I don’t see the need of using multiple GPUs, as shaving another second won’t have a significant impact overall.

In any case, I wanted to say thanks for your time, the worked example and for your detailed explanations.

Have you tried a large (>= 1 TB) top-of-the-line NVMe SSD? Supposedly those can sink more than 1GB/sec even for very long write streams.

Not yet, I am looking into different options. We are dealing with larger amounts of data, a couple hundred TB, and SSD’s won’t scale that well. Even if I used a couple of NVMe SSD’s in a raid 0, eventually, I would still need to move the files out of the SSD’s (and put them back on the server) to keep processing more data.

Anyhow, this is getting off the topic and maybe we would need another thread.

Don’t worry too much about wandering off topic a bit. Around here we allow for a fair amount of flexibility, especially once the original thread-starting question has been addressed. On the flip-side, we don’t offer karma points :-)

Your scenario of becoming I/O limited once computational bottlenecks have been eliminated with the help of CUDA/GPUs isn’t that unusual, and I am always interested in hearing about approaches to resolving those kind of issues. My knowledge doesn’t extend beyond NVMe SSDs in RAID0 configuration either.

Aren’t there SSD-based storage appliances? Or at least hybrid ones that include SSDs as a fast cache backed by large-capacity HDDs?

I’m sure there are hybrid solutions and the SSD cache would improve the I/O but I think we are also bandwidth limited (also the cache would require some time to be filled up).

This is not my expertise, but I think a potential solution would be to use several HHDs in a raid 10 on a 10GE network. It’s scalable, has redundancy and would perform fast l/O at the expense of losing storage capacity. It seems like that would bring the I/O speed to the same order of magnitude the GPUs are working at.

10 GigE is only 1.25 GB/s

Modern enterprise storage solutions for high bandwidth generally connect via 100 GigE or similar, possibly even multiple 100 GigE links for higher bandwidth.

NVIDIA has a variety of storage partners that we work with to deliver enterprise grade high-bandwidth storage solutions for GPU-enabled processing:

[url]https://www.nvidia.com/en-us/data-center/dgx-pod-reference-architecture/[/url]

That page lists about 5 partners, and although it is in the context of DGX, you don’t need a DGX solution to deploy one of their enterprise grade storage solutions for GPU accelerated processing. Those storage solutions can deliver data at 5GB/s and higher.