cudaMalloc problems

I found a strange problem with cudaMalloc:
I run matrixMul in SDK1.1 on GForce 8800 GTX. As shown the program, it malloc three arrays in GPU memory using cudaMalloc. However, the elapsed time shows their different:
Malloc the first array: 69.432 ms
Malloc the second array: 0.099 ms
Malloc the third array: 0.099 ms

Why?

try to call this set of mallocs twice in the code (even though it’s meaningless) just to benchmark purpose. If first malloc is at the beginning of the application some core initialization may be taking place during this call, this will be confirmed if second call of the same malloc configuration will take less time.

At first glance, i think the cost gap should be the overhead of initialization. However, the code calls CUT_DEVICE_INIT() before cudaMalloc.

It seems that the initialization should be finished in CUT_DEVICE_INIT(), right?

But the story is not true, it is done in cudaMalloc?

You can see what CUT_DEVICE_INIT() does – it is in cutil.h. It doesn’t download the driver to the card or anything like that. This is done with the first call to a kernel function. I’m not sure about cudaMalloc though.