cufftComplex memory allocation very high

Hello all,

In my program I need to transfer an array of cufftComplex values to the GPU before FFT’ing them there. I use the cufft library, exactly like I found

in examples an in the SDK. I allocate memory on the GPU of the appropriate size and time how long this allocation takes like this:

#define NX	 1024

	#define BATCH   1024

	cufftComplex *devPtr;

	unsigned int timer = 0;

	cutilCheckError( cutCreateTimer( &timer));

	cutilCheckError( cutStartTimer( timer));

	// GPU memory allocation 

	cudaMalloc((void**)&devPtr, sizeof(cufftComplex)*NX*BATCH);

	cutilCheckError( cutStopTimer( timer));

	printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

	cutilCheckError( cutDeleteTimer( timer));

output value of the timer is about 80ms! way longer than I expected and makes my plan to do 1024 parallel fft’s of size 1024 each on the GPU impossible.

The rest of my program (including memory transfers and such takes only 10ms all together).

Am I missing something here or is this a normal value for such an allocation?

Thanks

What happens when you change cufftcomplex to a float2?

I tried that. Just like in the simpleFFT example where they define a float2 Complex. Stays the same. It is the first operation I do in my program (I’m creating a dll, don’t know if this matters) so I thought maybe it had something to do with warm-up. I did an exactly the same allocation immediately after this one and timed it. Now the time it takes is well under 1ms. Very weird.

The first time a cuda function gets called in the runtime API (here cudaMalloc), some initialization stuff happens. I don’t know exactly what the initialization does, but I didn’t think it took that long!

Indeed, I never had anything like this. It shouldn’t take this long yet it does. I was hoping I did something obvious wrong that you CUDA gurus could solve :-)

I added the whole code below, still can’t figure out what’s wrong.

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

#include <string.h>

#include <math.h>

#include <cutil_inline.h>

#include <cufft.h>

#include <cuda.h>

#define BLOCK_W  16 

#define BLOCK_H  32

#define iDivUp(a,b) (((int)(a) % (int)(b) != 0) ? (((int)(a) /(int) (b)) + 1) : ((int)(a) /(int) (b)))

#if __DEVICE_EMULATION__

bool InitCUDA(void){return true;}

#else

bool InitCUDA(void)

{

	int count = 0;

	int i = 0;

	cudaGetDeviceCount(&count);

	if(count == 0) {

		fprintf(stderr, "There is no device.\n");

		return false;

	}

	for(i = 0; i < count; i++) {

		cudaDeviceProp prop;

		if(cudaGetDeviceProperties(&prop, i) == cudaSuccess) {

			if(prop.major >= 1) {

				break;

			}

		}

	}

	if(i == count) {

		fprintf(stderr, "There is no device supporting CUDA.\n");

		return false;

	}

	cudaSetDevice(i);

	printf("CUDA initialized.\n");

	return true;

}

#endif

/************************************************************

************/

/* Example															  */

/************************************************************

************/

int main(int argc, char** argv) 

{  

#define NX	 1024

#define BATCH   1024

unsigned int size = sizeof(cufftComplex)*NX*BATCH;

printf("Memory size total cufftComplex array: %f bytes\n", (float)size);

int i = 6;

unsigned int timer0 = 0;

cutilCheckError( cutCreateTimer( &timer0));

cutilCheckError( cutStartTimer( timer0));

float *sam;

cudaMalloc((void**)&sam,sizeof(int));

cutilCheckError( cutStopTimer( timer0));

printf("Processing time GPU float allocation attempt 0: %f (ms)\n", cutGetTimerValue( timer0));

cutilCheckError( cutDeleteTimer( timer0));

cufftComplex *devPtr1;

unsigned int timer1 = 0;

cutilCheckError( cutCreateTimer( &timer1));

cutilCheckError( cutStartTimer( timer1));

// GPU memory allocation 

cudaMalloc((void**)&devPtr1,size);

cutilCheckError( cutStopTimer( timer1));

printf("Processing time GPU cufftComplex allocation attempt 1: %f (ms)\n", cutGetTimerValue( timer1));

cutilCheckError( cutDeleteTimer( timer1));

cufftComplex *devPtr2;

unsigned int timer2 = 0;

cutilCheckError( cutCreateTimer( &timer2));

cutilCheckError( cutStartTimer( timer2));

// GPU memory allocation 

cudaMalloc((void**)&devPtr2,size);

cutilCheckError( cutStopTimer( timer2));

printf("Processing time GPU cufftComplex allocation attempt 2: %f (ms)\n", cutGetTimerValue( timer2));

cutilCheckError( cutDeleteTimer( timer2));

cufftComplex *devPtr3;

unsigned int timer3 = 0;

cutilCheckError( cutCreateTimer( &timer3));

cutilCheckError( cutStartTimer( timer3));

// GPU memory allocation 

cudaMalloc((void**)&devPtr3,size);

cutilCheckError( cutStopTimer( timer3));

printf("Processing time GPU cufftComplex allocation attempt 3: %f (ms)\n", cutGetTimerValue( timer3));

cutilCheckError( cutDeleteTimer( timer3));

}

with results:

Memory size total cufftComplex array: 8388608.000000 bytes

Processing time GPU float allocation attempt 0: 46.858471 (ms)

Processing time GPU cufftComplex allocation attempt 1: 0.125714 (ms)

Processing time GPU cufftComplex allocation attempt 2: 0.107276 (ms)

Processing time GPU cufftComplex allocation attempt 3: 0.105879 (ms)

Press any key to continue . . .

So now it is just the first memory allocation (this time a float of 1 number), not just a cufftComplex… :/

As was previously mentioned, and as you have found, the first call has overhead.

Why do you have a InitCUDA function that is never called?

Yes indeed, I do use it in my program, I must have deleted it to see what happened and then copy pasted the code without it.

In my code, the

bool InitCUDA(void);

gets called in the beginning of the main.

Do you meen the first call to be the first memory allocation or the first cuda function (like for example my InitCUDA)? Because if it is the latter,

adding this InitCUDA function doesn’t change anyhing to the time the first cudamalloc takes…

As I have seen in my own programs, the first cudaMemcpy takes a lotta time. I dont think that those 20 ms are that unusual. The numbers i have seen were quite similar, so if your code runs fine and u get the right results and correct time every time u run it, I suppose there is no bug related to what raised the time for uor memcpy.