Multiple batches of 1D FFT using cuFFT

Hi Team,
I’m trying to achieve parallel 1D FFTs on my CUDA 10.1, Nvidia GPU GTX 1050Ti. I was planning to achieve this using scikit-cuda’s FFT engine called cuFFT. I am able to schedule and run a single 1D FFT using cuFFT and the output matches the NumPy’s FFT output. The moment I launch parallel FFTs by increasing the batch size, the output does NOT match NumPy’s FFT. Could you please help with what could posibly be going wrong here? The code pasted below. ‘singleFFT’ param decided whether we schedule a single 1D FFT or a many of them.

# -*- coding: utf-8 -*-
"""
Created on Tue Aug 27 19:32:42 2019

@author: Ankit Sharma
"""

import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray

# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])


singleFFT = 0
if (1 == singleFFT):
    data_t      = data[0,:,0]
    fftAxis = 0
    BATCH = np.int32(1)
else:
    data_t      = data
    fftAxis = 1
    BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft     = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)

# calculate and time GPU FFT
t1 = process_time()
# transfer input data to Device
data_t_gpu  = gpuarray.to_gpu(data_t)
data_o_gpu  = gpuarray.empty_like(data_t_gpu)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)

dataFft_gpu = data_o_gpu.get()
t2 = process_time()
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))

You might want to try using cupy

Could you please elaborate or give a sample for using CuPy to schedule multiple 1d FFTs and beat the NumPy FFT by a good margin in processing time? I thought cuFFT or Pycuda’s FFT were soleley meant for this purpose. Would appreciate a small sample on this using scikit’s cuFFT, or PyCuda’s FFT.

Here’s an example with cupy:

$ cat t18.py
import numpy as np
from time import process_time
import cupy as cp

# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])

singleFFT = 0
if (1 == singleFFT):
    data_t      = data[0,:,0]
    fftAxis = 0
    BATCH = np.int32(1)
else:
    data_t      = data
    fftAxis = 1
    BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
dataFft_o     = np.fft.fft(data_t, axis=fftAxis)
t1 = process_time()
dataFft     = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)

# calculate and time GPU FFT
t1 = process_time()
# transfer input data to Device
data_t_gpu  = cp.array(data_t)
data_o_gpu  = cp.empty_like(data_t_gpu)
data_o1_gpu  = cp.fft.fft(data_t_gpu, axis=fftAxis)
cp.cuda.Device().synchronize()
t3 = process_time()
data_o_gpu  = cp.fft.fft(data_t_gpu, axis=fftAxis)
cp.cuda.Device().synchronize()
t4 = process_time()
dataFft_gpu = cp.asnumpy(data_o_gpu)
t2 = process_time()
print('\nGPU time is: ',t2-t1)
print('\nGPU compute time is: ',t4-t3)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-5))
$ python t18.py

CPU NumPy time is:  0.4426147770000002

GPU time is:  1.8312730369999999

GPU compute time is:  0.016624007999999968
True
$

When we compare just the computation time, after a warm up, the GPU is 443/17 times faster, i.e. 26x faster. Obviously this speedup factor will vary depending on what CPU and GPU you are comparing. This particular GPU happens to be a GTX 960.

Thanks Robert, this helps, though I have some specific questions:

  • Is there an Nvidia provided example code that does this same thing using either scikit cuda's cufft or PyCuda's fft? That will really help
  • On my machine the speed up is about 15x but the outputs match only if I reduce tolerance to 1e-4? What could be the reason
  • Does NumPy's FFT inherently use all the cores available on the host to speed up FFT execution?
  • I normally find the computation time is extremely small compared to the data transfer time between the host and device when using CUDA's functions. For my case, the data fetch from the device is almost 160ms. Is this typically the case? What could I do better to avoid this huge latency?

Thanks again.

Also, do we really need to use cuda.Device()synchronize function here? Does cupy inherently launch multiple streams? Please help explain this.

NVIDIA doesn’t develop or maintain scikit cuda or pycuda. I don’t think you’ll find any NVIDIA sample codes for anything having to do with those libraries. NVIDIA also doesn’t maintain numba and formally doesn’t maintain cupy.

Regarding the speed, your GPU might be slower, and/or your CPU might be faster. Regarding the tolerance, in my experience, complex parallelized floating point computations rarely match identically between GPU and CPU. There are a number of reasons for this, and the internet is full of such questions. In general, when comparing two floating point numbers, one must be aware of the range of the numbers (magnitude) and take that into account. The calculations here are in single precision floating point, so they have at most about 6 accurate digits. Using a fixed tolerance of 1e-4 doesn’t take into account such a range. So it’s a not very useful method for comparison. Anyway this is a lengthy topic and I’m not going to write a treatise here. Research will serve you will. Here’s a NVIDIA whitepaper that may be of interest:

https://docs.nvidia.com/cuda/floating-point/index.html

It mostly explains how numerical differences can enter between CPU and GPU results. There are many analyses on the web that if done carefully show that the GPU produces more accurate results than the CPU. It may or may not be the case here.

I don’t know.

In my experience, the necessity to transfer the data to the device can’t be avoided, and there is usually little that can be done to optimize it. If there is anything to be done to optimize it, it usually requires profiling and careful analysis to figure that out. I’m not going to do that for you. For example, on the “careful analysis” side, I would start by figuring out how much data (bytes) is actually being transferred, and see if the transfer time is consistent with that. Next would be to discover if that size in bytes makes sense for the problem at hand. The profilers can help with some of this kind of verification. In these situations, the usual suggestion is to acknowledge the cost of data transfer, but make it irrelevant in the overall scheme. What this means is if you intend to do one single thing on the GPU, that is very often not a good application design. If the only thing you intend to do is this single FFT operation, and everything else will be conducted on the CPU, that’s generally not a very good use case for GPU computing. Instead transfer as much of the workflow to the GPU as possible. If it costs 160ms to transfer the data to the GPU, then do more work on the GPU with that data, so that the 160ms cost is trivial. Can’t figure out a way to do that? Then your application design may not be a very good fit for the GPU.

The synchronize usage here has nothing to do with streams, but has to do with the asynchronous nature of CPU and GPU activity. It is standard practice when doing a careful, trustworthy job of timing individual GPU operations. Again, there are many questions about this on the web, and some research here will fill in your knowledge gaps.

I am dealing with the same problem. I have a binary file, say 16 GB, that stores many replicas of a signal (let’s say my signal is comprised by 25000 integers). In trying to optimize/parallelize performing as many 1d fft’s as replicas I have, I use 1d batched cufft. I took this code as a starting point: [url]cuda - 1D batched FFTs of real arrays - Stack Overflow.

To minimize the number of memory transfers I calculate the maximum batch size that will fit on my GPU based on my memory size. In this case, I rely on a for loop to make X amount of batched fft’s to cover the entire 16 GB. I realized that my outputs were all zero if the (batch_size * signal_length) > maximum memory pitch. If I reduce the batch size so that the total size is smaller than the maximum memory pitch, then it all works fine.

My main issue is that I need to halve the batch size which is doubling the processing time. This also limits the amount of GPU memory I can use to 4GB out of 8GB. Is there an alternative, am I approaching this problem totally wrong? I have been looking for a solution to this problem but most of what I find is for 2D or 3D fft’s, this thread is the closest I have seen so far.

Any help will be appreciated. Thanks!

@gemas135, may be code snippet could help understand what exactly you are doing. My problem was resolved with the use of CuPy and the correct memory alignment of the FFT dimension, since my development is in Python. For C, I hope you are using cufftPlanMany() instead of the deprecated cufftPlan1d().

Ankits, like I tried to explain, my data size exceeds what can fit on the GPU memory. So what I’m doing instead is multiple “runs” of 1d batched ffts. Below is an implementation:

#include <cuda.h>
#include <cufft.h>
#include <stdio.h>
#include <math.h>

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

/********/
/* MAIN */
/********/

int main(){

    FILE *fid_in, *fid_out;
    cudaDeviceProp prop;
    int ndev;
    cufftReal *hostInputData; 
    cufftComplex *hostOutputData;

    //Query device properties
    cudaGetDeviceCount(&ndev);
    cudaGetDeviceProperties(&prop,0);

    // Open input and output files
    fid_in = fopen(in_filename, "rb");
    fid_out = fopen(out_filename, "wb");

    // Calculate size file
    fseek(fid_in, 0L, SEEK_END); // move until end of file
    long sz = ftell(fid_in); // file size
    fseek(fid_in, 0L, SEEK_SET); //rewind for future readings

    // Calculate max batch size
    int raw_precision = 2; // bytes
    int fft_precision = 4; // input is cufftReal (float 32 bit) output is cufftComplex (2 * float 32 bit) but half the samples

    int max_elem = prop.totalGlobalMem/2/fft_precision; //divide by 2 to account for input and output arrays, then over the precision to get the max number of elements
    int BATCH_MAX = floor((max_elem/DATASIZE)*0.47); // If I set anything above 50%, I get a memory allocation error and the fft output is all 0.

    int nruns = ceil((double) sz/2/(BATCH_MAX*DATASIZE)); // number of times the max batch is needed to cover entire input data
    
    // Raw data memory allocation
    raw_data = (uint16_t *)malloc(sizeof(uint16_t)*BATCH_MAX*DATASIZE);

    // --- Host side input data allocation and initialization -> Convert from unsigned 16-bit integer to 32-bit float
    gpuErrchk(cudaMallocHost((void**)&hostInputData, DATASIZE * BATCH_MAX * sizeof(cufftReal)));

    // --- Device side input data allocation and initialization
    cufftReal *deviceInputData; 
    gpuErrchk(cudaMalloc((void**)&deviceInputData, DATASIZE * BATCH_MAX * sizeof(cufftReal)));

    // --- Host side output data allocation
    gpuErrchk(cudaMallocHost((void**)&hostOutputData, (DATASIZE/2 + 1) * BATCH_MAX * sizeof(cufftComplex)));

    // --- Device side output data allocation
    cufftComplex *deviceOutputData; 
    gpuErrchk(cudaMalloc((void**)&deviceOutputData, (DATASIZE / 2 + 1) * BATCH_MAX * sizeof(cufftComplex)));

    // --- Batched 1D FFTs
    cufftHandle handle;
    int rank = 1;                           // --- 1D FFTs
    int n[] = { DATASIZE };                 // --- Size of the Fourier transform
    int istride = 1, ostride = 1;           // --- Distance between two successive input/output elements
    int idist = DATASIZE, odist = (DATASIZE / 2 + 1); // --- Distance between batches
    int inembed[] = { 0 };                  // --- Input size with pitch (ignored for 1D transforms)
    int onembed[] = { 0 };                  // --- Output size with pitch (ignored for 1D transforms)
                                  // --- Number of batched executions
    cufftPlanMany(&handle, rank, n, 
                  inembed, istride, idist,
                  onembed, ostride, odist, CUFFT_R2C, BATCH_MAX);

    // Start loop of 1d batched ffts
    for (int irun = 0; irun < nruns; irun++){

        if(irun==nruns-1){
        //adjust size if last run of batches as it may be smaller  
            long sz_current = ftell(fid_in);
            BATCH = (sz-sz_current)/2/DATASIZE;

            cufftPlanMany(&handle, rank, n, 
                  inembed, istride, idist,
                  onembed, ostride, odist, CUFFT_R2C, BATCH);

        }

        //read my data
        fread(raw_data, sizeof(uint16_t), DATASIZE*BATCH, fid_in);

        // Convert from 16uint to single-precision float
        for (int i=0; i<BATCH; i++)
            for (int j=0; j<DATASIZE; j++) 
                hostInputData[i*DATASIZE + j] = (cufftReal) raw_data[i*DATASIZE + j];
        
        // ---- Host->Device copy the input
        gpuErrchk(cudaMemcpy(deviceInputData, hostInputData, DATASIZE * BATCH_MAX * sizeof(cufftReal), cudaMemcpyHostToDevice));

        // ---- Calculate fft
        cufftExecR2C(handle,  deviceInputData, deviceOutputData);

        // --- Device->Host copy of the results
        gpuErrchk(cudaMemcpy(hostOutputData, deviceOutputData, (DATASIZE / 2 + 1) * BATCH_MAX * sizeof(cufftComplex), cudaMemcpyDeviceToHost));

        // Write results to file
        fwrite(hostOutputData, sizeof(cufftComplex), (DATASIZE / 2 + 1) * BATCH, fid_out);

    }

    cufftDestroy(handle);
    gpuErrchk(cudaFree(deviceOutputData));
    gpuErrchk(cudaFree(deviceInputData));

    fclose(fid_in);
    fclose(fid_out);
    return 0;

}

I currently have hard coded to use 47% of the GPU memory, if I go over 50% I get a GPU memory allocation error even though I am not using all the GPU memory and cufft returns zeros. The problem I have is that I can’t use all or most of the memory and I am not sure why. My guess is that BATCH_MAX*DATASIZE exceeds the grid limit size along x? If that is the case, what is an alternative approach to this problem in order to fully utilize the resources available?

Sorry for missing this. Hoping this is resolved by now? In case it isn’t could try the sample codes given post installing the CUDA toolkit and shuffle around the FFT sizes there? Do you observe a similar memory dependency there? Though I use Python’s cuFFT, which is almost a replica of C-based cuFFT, I don’t experience such memory dependencies.