I have very large 2D arrays (occupying over 60 GB on disk) in which I have to perform 1D fft’s column by column and I have at my disposal as much as 8 gpus connected by PCIE. The size of the transform is small (although not power of 2) and I just have to take a lot of them.
I have tested a program using one GPU in which I sequentially: read data, copy htod, execute batch 1d fft, copy dtoh and write results on disk. I iterate on the previous steps until I cover all the file. No issues there and the results are okay. Now, I would like to exploit all the gpu resources I have in order to speed up these calculations and I started looking at different options.
I am taking the example provided in the documentation to calculate 1D ffts on multiple gpus, just slightly modified so that it can compile.
// Demonstrate how to use CUFFT to perform 1-d FFTs using 2 GPUs
// Output on the GPUs is in natural output
// Function return codes should be checked for errors in actual code
//
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
// CUDA runtime
#include <cuda_runtime.h>
//CUFFT Header file
#include <cufftXt.h>
int main(){
// cufftCreate() - Create an empty plan
cufftHandle plan_input; cufftResult result;
result = cufftCreate(&plan_input);
//
// cufftXtSetGPUs() - Define which GPUs to use
int nGPUs = 2, whichGPUs[2];
whichGPUs[0] = 0; whichGPUs[1] = 1;
result = cufftXtSetGPUs (plan_input, nGPUs, whichGPUs);
//
// Initialize FFT input data
size_t worksize[2];
cufftComplex *host_data_input, *host_data_output;
int nx = 16384, batch = 2;
int size_of_data = sizeof(cufftComplex) * nx * batch;
cudaMallocHost((void**)&host_data_input, size_of_data * sizeof(cufftReal));
cudaMallocHost((void**)&host_data_output, size_of_data * sizeof(cufftReal));
int rank = 1; // --- 1D FFTs
int n[] = { nx }; // --- Size of the Fourier transform
int istride = 1, ostride = 1; // --- Distance between two successive input/output elements
int idist = nx, odist = (nx); // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
for (int i = 0; i< nx*batch; i++){
host_data_input[i].x = i*i;
host_data_input[i].y = i;
//printf("input host %f \n", host_data_input[i].x);
}
//cufftResult
cufftGetSizeMany(plan_input, rank, n, inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch, worksize);
printf("Work size area is %lu and %lu\n", worksize[0], worksize[1]);
// cufftMakePlanMany() - Create the plan
result = cufftMakePlanMany (plan_input, rank, n, inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch, worksize);
//
// cufftXtMalloc() - Malloc data on multiple GPUs
cudaLibXtDesc *device_data_input, *device_data_output;
result = cufftXtMalloc (plan_input, &device_data_input,
CUFFT_XT_FORMAT_INPLACE);
result = cufftXtMalloc (plan_input, &device_data_output,
CUFFT_XT_FORMAT_INPLACE);
//
// cufftXtMemcpy() - Copy data from host to multiple GPUs
result = cufftXtMemcpy (plan_input, device_data_input,
host_data_input, CUFFT_COPY_HOST_TO_DEVICE);
//
// cufftXtExecDescriptorC2C() - Execute FFT on multiple GPUs
result = cufftXtExecDescriptorC2C (plan_input, device_data_input,
device_data_input, CUFFT_FORWARD);
//
// cufftXtMemcpy() - Copy the data to natural order on GPUs
result = cufftXtMemcpy (plan_input, device_data_output,
device_data_input, CUFFT_COPY_DEVICE_TO_DEVICE);
//
// cufftXtMemcpy() - Copy natural order data from multiple GPUs to host
result = cufftXtMemcpy (plan_input, host_data_output,
device_data_output, CUFFT_COPY_DEVICE_TO_HOST);
//
// Print output and check results
for (int i = 0; i<100; i++){
printf("host output data i: %d, real %f, complex %f \n", i, host_data_output[i].x, host_data_output[i].y);
}
//
// cufftXtFree() - Free GPU memory
result = cufftXtFree(device_data_input);
result = cufftXtFree(device_data_output);
//
// cufftDestroy() - Destroy FFT plan
result = cufftDestroy(plan_input);
return 0;
}
I notice the following:
a) if I increment the batch size to anything greater than 1 it outputs zeros.
b) with batch = 1, if the transform size is not power of 2 it outputs zeros.
Is there something wrong in my code? Or cuffXt simply won’t work with fft batch sizes larger than 1? I have run cuda-memcheck every time and it reports no errors although the output is zero. Will there be any speed-up in doing 1D batch fft across multiple gpus or is not worth exploring it?
Thanks!