Help regarding slow cudaMalloc

arjunjain · November 28, 2008, 3:24pm

Hi everyone,

Let me start with my snippet of code:

//~ Timer 1
StartTimer(&timer1);
numLoops = iDivUp(numLoops,16)*16;
int numPts = data->numPts;
int numPtsUp = iDivUp(numPts, 16)*16;
float *d_coord, *d_homo;
int *d_randPts, h_randPts;
int randSize = 7sizeof(int)numLoops;
cudaMalloc((void **)&d_randPts, randSize); <=========================================================
======= Statement I
CUDA_SAFE_CALL(cudaMalloc((void **)&d_homo, 9sizeof(float)numLoops)); <======================================== Statement II
h_randPts = (int)malloc(randSize);
int *validPts = (int *)malloc(sizeof(int)*numPts);
int numValid = 0;
//validPts is a list of index of all valid pts
numValid = numPts;
gpuTime1 = StopTimer(timer1);
//~ -----------------
printf("\n%f ", gpuTime1);

Now, when I run my program, without Statement I and II, I get gpuTime1 == 0.013000
When I run it with Statement I = 44.78
When I run it with Statement II = 43.34
When I run it with Statement I and II = 44.98

My question is, why dont the time add up, and is this is normal behaviour (about 40 msec for cudaMalloc?) Am I doing something wrong?

Thanks in advance,
Arjun

kduc · November 28, 2008, 3:32pm

Hi everyone,

Let me start with my snippet of code:

//~ Timer 1

StartTimer(&timer1);

numLoops = iDivUp(numLoops,16)*16;

int numPts = data->numPts;

int numPtsUp = iDivUp(numPts, 16)*16;

float *d_coord, *d_homo;

int *d_randPts, *h_randPts;

int randSize = 7*sizeof(int)*numLoops;

cudaMalloc((void **)&d_randPts, randSize); <=========================================================

======= Statement I

CUDA_SAFE_CALL(cudaMalloc((void **)&d_homo, 9*sizeof(float)*numLoops)); <======================================== Statement II

h_randPts = (int*)malloc(randSize);

int *validPts = (int *)malloc(sizeof(int)*numPts);

int numValid = 0;

//validPts is a list of index of all valid pts

numValid = numPts;

gpuTime1 = StopTimer(timer1);

//~ -----------------

printf("\n%f ", gpuTime1);

Now, when I run my program, without Statement I and II, I get gpuTime1 == 0.013000

When I run it with Statement I = 44.78

When I run it with Statement II = 43.34

When I run it with Statement I and II = 44.98

My question is, why dont the time add up, and is this is normal behaviour (about 40 msec for cudaMalloc?) Am I doing something wrong?

Thanks in advance,

Arjun

Just try to put:

cudaFree(0);

at the very beginning of your CUDA code, right after the initilization of your device.

Khanh

alex_dubinsky · November 28, 2008, 7:14pm

The very first substantial CUDA call incurs the cost of initializing the runtime. The suggestion to insert a dummy “cudaFree(0);” is interesting. This won’t crash anything, now or in the future?

tmurray · November 28, 2008, 8:32pm

It won’t. It especially won’t now that we’ve suggested it publicly.

You could always get around this by explicitly creating a context with the driver API, but that’s a lot closer to unsupported than cudaFree(0).

E.D_Riedijk · November 28, 2008, 9:23pm

Doesn’t cudaSetDevice(0) also create a context if I understand correctly from my matlab troubles I had before?

tmurray · November 28, 2008, 9:36pm

nope, it doesn’t. cudaSetDevice was changed in 2.1 to return an error when you call it after a context has been created. in your case, I think the context never went away at all (are mex files executed from the same thread as the main matlab computation thread?), so successive calls broke things (so calling cudaThreadExit destroyed the context and life was good upon successive calls).

E.D_Riedijk · November 28, 2008, 10:46pm

Yeah, mex files seem to live in the same thread as matlab. And I was indeed not thinking straight, I should put a block on forum posting after 23:00 ;)

arjunjain · November 29, 2008, 3:44am

Ah, works like a charm… But what really happened? Why does the allocation become so quick? Thanks!

tmurray · November 29, 2008, 4:12am

The first call is (in driver API terms)

cuInit(0);

cuCtxCreate(&ctx, 0, dev);

cudaMalloc(…);

so, by doing cudaFree(0) early on, you’re initializing the context before you start timing things.

alex_dubinsky · November 29, 2008, 9:06am

Topic		Replies	Views
cudaMalloc's taking different times CUDA Programming and Performance	3	1952	December 22, 2010
cudaMalloc problems CUDA Programming and Performance	3	2296	April 24, 2008
cudaMalloc CUDA Programming and Performance	1	5621	January 20, 2009
Calculate time ? CUDA Programming and Performance	5	2868	November 23, 2008
cudaMalloc takes several seconds CUDA Programming and Performance	6	2576	August 13, 2013
Cuda Malloc CudaFree before CudaMalloc, how is that possible? CUDA Programming and Performance	1	2734	May 7, 2012
Memory Allocation Time Takes too much time!! CUDA Programming and Performance	3	4639	August 28, 2009
Is first cudaMalloc() will take more time? then how much? CUDA Programming and Performance	1	2955	April 15, 2009
cudaMalloc taking 4 seconds CUDA Programming and Performance	4	850	November 23, 2011
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3374	June 23, 2009

Help regarding slow cudaMalloc

Related topics