I found a strange problem with cudaMalloc:
I run matrixMul in SDK1.1 on GForce 8800 GTX. As shown the program, it malloc three arrays in GPU memory using cudaMalloc. However, the elapsed time shows their different:
Malloc the first array: 69.432 ms
Malloc the second array: 0.099 ms
Malloc the third array: 0.099 ms
try to call this set of mallocs twice in the code (even though it’s meaningless) just to benchmark purpose. If first malloc is at the beginning of the application some core initialization may be taking place during this call, this will be confirmed if second call of the same malloc configuration will take less time.
You can see what CUT_DEVICE_INIT() does – it is in cutil.h. It doesn’t download the driver to the card or anything like that. This is done with the first call to a kernel function. I’m not sure about cudaMalloc though.