cufftComplex memory allocation very high

Eluri · December 8, 2009, 12:38pm

Hello all,

In my program I need to transfer an array of cufftComplex values to the GPU before FFT’ing them there. I use the cufft library, exactly like I found

in examples an in the SDK. I allocate memory on the GPU of the appropriate size and time how long this allocation takes like this:

#define NX	 1024

	#define BATCH   1024

	cufftComplex *devPtr;

	unsigned int timer = 0;

	cutilCheckError( cutCreateTimer( &timer));

	cutilCheckError( cutStartTimer( timer));

	// GPU memory allocation 

	cudaMalloc((void**)&devPtr, sizeof(cufftComplex)*NX*BATCH);

	cutilCheckError( cutStopTimer( timer));

	printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

	cutilCheckError( cutDeleteTimer( timer));

output value of the timer is about 80ms! way longer than I expected and makes my plan to do 1024 parallel fft’s of size 1024 each on the GPU impossible.

The rest of my program (including memory transfers and such takes only 10ms all together).

Am I missing something here or is this a normal value for such an allocation?

Thanks

eelsen · December 8, 2009, 6:45pm

What happens when you change cufftcomplex to a float2?

Eluri · December 8, 2009, 6:53pm

I tried that. Just like in the simpleFFT example where they define a float2 Complex. Stays the same. It is the first operation I do in my program (I’m creating a dll, don’t know if this matters) so I thought maybe it had something to do with warm-up. I did an exactly the same allocation immediately after this one and timed it. Now the time it takes is well under 1ms. Very weird.

eelsen · December 8, 2009, 7:03pm

The first time a cuda function gets called in the runtime API (here cudaMalloc), some initialization stuff happens. I don’t know exactly what the initialization does, but I didn’t think it took that long!

Eluri · December 8, 2009, 7:11pm

Indeed, I never had anything like this. It shouldn’t take this long yet it does. I was hoping I did something obvious wrong that you CUDA gurus could solve :-)

Eluri · December 9, 2009, 11:54am

I added the whole code below, still can’t figure out what’s wrong.

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

#include <string.h>

#include <math.h>

#include <cutil_inline.h>

#include <cufft.h>

#include <cuda.h>

#define BLOCK_W  16 

#define BLOCK_H  32

#define iDivUp(a,b) (((int)(a) % (int)(b) != 0) ? (((int)(a) /(int) (b)) + 1) : ((int)(a) /(int) (b)))

#if __DEVICE_EMULATION__

bool InitCUDA(void){return true;}

#else

bool InitCUDA(void)

{

	int count = 0;

	int i = 0;

	cudaGetDeviceCount(&count);

	if(count == 0) {

		fprintf(stderr, "There is no device.\n");

		return false;

	}

	for(i = 0; i < count; i++) {

		cudaDeviceProp prop;

		if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {

			if(prop.major >= 1) {

				break;

			}

		}

	}

	if(i == count) {

		fprintf(stderr, "There is no device supporting CUDA.\n");

		return false;

	}

	cudaSetDevice(i);

	printf("CUDA initialized.\n");

	return true;

}

#endif

/************************************************************

************/

/* Example															  */

/************************************************************

************/

int main(int argc, char** argv) 

{  

#define NX	 1024

#define BATCH   1024

unsigned int size = sizeof(cufftComplex)*NX*BATCH;

printf("Memory size total cufftComplex array: %f bytes\n", (float)size);

int i = 6;

unsigned int timer0 = 0;

cutilCheckError( cutCreateTimer( &timer0));

cutilCheckError( cutStartTimer( timer0));

float *sam;

cudaMalloc((void**)&sam,sizeof(int));

cutilCheckError( cutStopTimer( timer0));

printf("Processing time GPU float allocation attempt 0: %f (ms)\n", cutGetTimerValue( timer0));

cutilCheckError( cutDeleteTimer( timer0));

cufftComplex *devPtr1;

unsigned int timer1 = 0;

cutilCheckError( cutCreateTimer( &timer1));

cutilCheckError( cutStartTimer( timer1));

// GPU memory allocation 

cudaMalloc((void**)&devPtr1,size);

cutilCheckError( cutStopTimer( timer1));

printf("Processing time GPU cufftComplex allocation attempt 1: %f (ms)\n", cutGetTimerValue( timer1));

cutilCheckError( cutDeleteTimer( timer1));

cufftComplex *devPtr2;

unsigned int timer2 = 0;

cutilCheckError( cutCreateTimer( &timer2));

cutilCheckError( cutStartTimer( timer2));

// GPU memory allocation 

cudaMalloc((void**)&devPtr2,size);

cutilCheckError( cutStopTimer( timer2));

printf("Processing time GPU cufftComplex allocation attempt 2: %f (ms)\n", cutGetTimerValue( timer2));

cutilCheckError( cutDeleteTimer( timer2));

cufftComplex *devPtr3;

unsigned int timer3 = 0;

cutilCheckError( cutCreateTimer( &timer3));

cutilCheckError( cutStartTimer( timer3));

// GPU memory allocation 

cudaMalloc((void**)&devPtr3,size);

cutilCheckError( cutStopTimer( timer3));

printf("Processing time GPU cufftComplex allocation attempt 3: %f (ms)\n", cutGetTimerValue( timer3));

cutilCheckError( cutDeleteTimer( timer3));

}

with results:

Memory size total cufftComplex array: 8388608.000000 bytes

Processing time GPU float allocation attempt 0: 46.858471 (ms)

Processing time GPU cufftComplex allocation attempt 1: 0.125714 (ms)

Processing time GPU cufftComplex allocation attempt 2: 0.107276 (ms)

Processing time GPU cufftComplex allocation attempt 3: 0.105879 (ms)

Press any key to continue . . .

So now it is just the first memory allocation (this time a float of 1 number), not just a cufftComplex… External Image

Tigga · December 9, 2009, 12:36pm

As was previously mentioned, and as you have found, the first call has overhead.

Why do you have a InitCUDA function that is never called?

Eluri · December 9, 2009, 12:42pm

Yes indeed, I do use it in my program, I must have deleted it to see what happened and then copy pasted the code without it.

In my code, the

bool InitCUDA(void);

gets called in the beginning of the main.

Do you meen the first call to be the first memory allocation or the first cuda function (like for example my InitCUDA)? Because if it is the latter,

adding this InitCUDA function doesn’t change anyhing to the time the first cudamalloc takes…

ONeill · December 15, 2009, 9:27am

Yes indeed, I do use it in my program, I must have deleted it to see what happened and then copy pasted the code without it.

In my code, the
bool InitCUDA(void);
gets called in the beginning of the main.

Do you meen the first call to be the first memory allocation or the first cuda function (like for example my InitCUDA)? Because if it is the latter,

adding this InitCUDA function doesn’t change anyhing to the time the first cudamalloc takes…

As I have seen in my own programs, the first cudaMemcpy takes a lotta time. I dont think that those 20 ms are that unusual. The numbers i have seen were quite similar, so if your code runs fine and u get the right results and correct time every time u run it, I suppose there is no bug related to what raised the time for uor memcpy.

Topic		Replies	Views
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23177	July 8, 2011
First cudaMalloc() takes long time? CUDA Programming and Performance	13	17173	April 23, 2021
Cuda runtime call after driver api call, excessive overhead CUDA Programming and Performance cuda , driver , api	17	1981	December 24, 2021
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5152	October 29, 2019
Estimating FFT Performance CUDA Programming and Performance	9	1549	June 4, 2010
The initialization of cufft ? GPU-Accelerated Libraries	5	2693	February 13, 2016
cufftPlan2d fails CUDA Programming and Performance	14	21030	September 17, 2007
FFT Computation Timing constraint on GPU. CUDA Programming and Performance	0	708	August 22, 2014
Large data size for cuFFT GPU-Accelerated Libraries	8	3958	September 8, 2018
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6557	February 19, 2009

cufftComplex memory allocation very high

Related topics