Batch limit of 1D cufft

viterbi · October 10, 2009, 4:00am

Hello, I would like to use CUDA to do many (say 200,000) 1-D FFT of size 256 simultaneously. However, there seems to be a limit of the batch number, which is 65,535, if the batch number is larger than this number, then results are completely wrong. Anyone know what’s going on?

Here are the configuration of my video card and my code:

[codebox]

Device 0: “Tesla C1060”

CUDA Driver Version: 2.30

CUDA Runtime Version: 2.30

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 3

Total amount of global memory: 4294705152 bytes

Number of multiprocessors: 30

Number of cores: 240

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: No

Integrated: No

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)

[/codebox]

[codebox]

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

// includes, project

#include <cufft.h>

#include <cutil_inline.h>

////////////////////////////////////////////////////////////////////////////////

// declaration, forward

void runTest(int argc, char** argv);

#define SIGNAL_SIZE 256

#define N_SIGNAL 65536

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char** argv)

{

runTest(argc, argv);

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void runTest(int argc, char** argv)

{

if( cutCheckCmdLineFlag(argc, (const char**)argv, "device") )

	cutilDeviceInit(argc, argv);

else

	cudaSetDevice( cutGetMaxGflopsDeviceId() );

// Allocate host memory for the signal

float* h_signal = (float *)malloc(sizeof(float)*(SIGNAL_SIZE+2)*N_SIGNAL);

// Initalize the memory for the signal

for (unsigned int i = 0; i < N_SIGNAL; i++) {

float fl = 0.1+0.3*i/float(N_SIGNAL);

for (unsigned int j = 0; j < SIGNAL_SIZE+2; j++) {

    	h_signal[i*(SIGNAL_SIZE+2)+j] = sin(2*3.1416*fl*j);

}

}

// Allocate device memory for signal

float* d_signal;

cutilSafeCall(cudaMalloc((void**)&d_signal, sizeof(float)*(SIGNAL_SIZE+2)*N_SIGNAL));

if ( cudaSuccess == cudaGetLastError() ) {  printf("Success!\n"); }

else { printf("Error %d\n", cudaGetLastError()); }

// Copy host memory to device

cutilSafeCall(cudaMemcpy(d_signal, h_signal, sizeof(float)*(SIGNAL_SIZE+2)*N_SIGNAL, cudaMemcpyHostToDevice));

if ( cudaSuccess == cudaGetLastError() ) {  printf("Success!\n"); }

else { printf("Error %d\n", cudaGetLastError()); }

// CUFFT plan

cufftHandle plan;

cufftPlan1d(&plan, SIGNAL_SIZE, CUFFT_R2C, N_SIGNAL);

if ( cudaSuccess == cudaGetLastError() ) {  printf("Success!\n"); }

else { printf("Error %d\n", cudaGetLastError()); }

// FFT

cufftExecR2C(plan, (cufftReal *)d_signal, (cufftComplex *)d_signal);

printf("Error: %s\n", cudaGetErrorString(cudaGetLastError()));

// Copy device memory to host

cutilSafeCall(cudaMemcpy(h_signal, d_signal, sizeof(float)*(SIGNAL_SIZE+2)*N_SIGNAL, cudaMemcpyDeviceToHost));

if ( cudaSuccess == cudaGetLastError() ) {  printf("Success!\n"); }

else { printf("Error %d\n", cudaGetLastError()); }

for (unsigned int i = 0; i < N_SIGNAL; i++) {

float fl = 0.1+0.3*i/float(N_SIGNAL);

for (unsigned int j = 0; j < (SIGNAL_SIZE+2)/2; j++) {

	printf("i=%5d   fl=%10.4f   j=%5d    f=%10.4f   h[i]=%10.4f %10.4f\n", i, fl, j, j/float(SIGNAL_SIZE), h_signal[i*(SIGNAL_SIZE+2)+j*2], h_signal[i*(SIGNAL_SIZE+2)+j*2+1]);

    }

printf("\n");

}

//Destroy CUFFT context

cufftSafeCall(cufftDestroy(plan));

// cleanup memory

free(h_signal);

cutilSafeCall(cudaFree(d_signal));

cudaThreadExit();

}

[/codebox]

LSChien · October 10, 2009, 4:32am

I suffer the same problem, in CUFFT_Library_2.3.pdf, it says

“1D transform sizes up to 8 million elements”

if you use batch = 65535, then it exceeds 8 million elements.

for your application, batch = 200,000, divide 200,000 = (16K)*12 + 3392

and use two planes, one is for batch = 16K, the other is for batch = 3392

Topic		Replies	Views
CUFFT Issues CUDA Programming and Performance	2	835	February 4, 2011
CUFFT issue CUDA Programming and Performance	0	1105	December 29, 2009
How to decide maximum size of the data using 1D CUFFT CUDA Programming and Performance	1	1779	September 19, 2011
cufft maximum 1d size with batch cuda error doing very large 1dffts CUDA Programming and Performance	0	3653	April 8, 2010
Memory needed for CUDA 1D FFT plan creation - or how to make saussage with CUDA hacks CUDA Programming and Performance	8	12241	February 9, 2011
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	4882	October 29, 2019
CUFFT array size CUDA Programming and Performance	1	720	October 17, 2013
CUDA 2.2 and failing CUFFT SDK example CUDA Programming and Performance	3	10569	June 9, 2009
size limit of 1D FFT CUDA Programming and Performance	8	2524	September 24, 2011
cufft: ERROR: CUFFT_INVALID_PLAN CUDA Programming and Performance	2	2260	June 2, 2007

Batch limit of 1D cufft

Related topics