Help regarding slow cudaMalloc

Hi everyone,

Let me start with my snippet of code:

//~ Timer 1
StartTimer(&timer1);
numLoops = iDivUp(numLoops,16)*16;
int numPts = data->numPts;
int numPtsUp = iDivUp(numPts, 16)*16;
float *d_coord, *d_homo;
int *d_randPts, h_randPts;
int randSize = 7
sizeof(int)numLoops;
cudaMalloc((void **)&d_randPts, randSize); <=========================================================
======= Statement I
CUDA_SAFE_CALL(cudaMalloc((void **)&d_homo, 9
sizeof(float)numLoops)); <======================================== Statement II
h_randPts = (int
)malloc(randSize);
int *validPts = (int *)malloc(sizeof(int)*numPts);
int numValid = 0;
//validPts is a list of index of all valid pts
numValid = numPts;
gpuTime1 = StopTimer(timer1);
//~ -----------------
printf("\n%f ", gpuTime1);

Now, when I run my program, without Statement I and II, I get gpuTime1 == 0.013000
When I run it with Statement I = 44.78
When I run it with Statement II = 43.34
When I run it with Statement I and II = 44.98

My question is, why dont the time add up, and is this is normal behaviour (about 40 msec for cudaMalloc?) Am I doing something wrong?

Thanks in advance,
Arjun

Just try to put:

cudaFree(0);

at the very beginning of your CUDA code, right after the initilization of your device.

  • Khanh

The very first substantial CUDA call incurs the cost of initializing the runtime. The suggestion to insert a dummy “cudaFree(0);” is interesting. This won’t crash anything, now or in the future?

It won’t. It especially won’t now that we’ve suggested it publicly.

You could always get around this by explicitly creating a context with the driver API, but that’s a lot closer to unsupported than cudaFree(0).

Doesn’t cudaSetDevice(0) also create a context if I understand correctly from my matlab troubles I had before?

nope, it doesn’t. cudaSetDevice was changed in 2.1 to return an error when you call it after a context has been created. in your case, I think the context never went away at all (are mex files executed from the same thread as the main matlab computation thread?), so successive calls broke things (so calling cudaThreadExit destroyed the context and life was good upon successive calls).

Yeah, mex files seem to live in the same thread as matlab. And I was indeed not thinking straight, I should put a block on forum posting after 23:00 ;)

Ah, works like a charm… But what really happened? Why does the allocation become so quick? Thanks!

The first call is (in driver API terms)

cuInit(0);

cuCtxCreate(&ctx, 0, dev);

cudaMalloc(…);

so, by doing cudaFree(0) early on, you’re initializing the context before you start timing things.