cudaMalloc() time difference cudaMalloc() takes different times on (nearly) identical GPUs

I have 2 systems, one with a Tesla c1060 and the oher with Tesla m1060. I am running identical programs on both. When I am doing cudaMalloc() (about 32Mb), the c1060 takes about 51ms, but the m1060 takes 1.4 seconds.

Anyone know what I should do to bring the time down on the m1060?

Thanks in advance for any help.

Bala

run nvidia-smi -l in the background while you are running your cudaMalloc, and you will probably see the disparity disappear.

Thank you. I will try this.

Is there an equivalent cuda API call I can make from the application itself?

Thank you,

Bala

presumably you’re not running X on the M1060 box and are on the C1060 box. there’s a period of time to initialize internal driver state that gets torn down when there are no clients of the Linux driver, which is why that warmup time is there. there will be a utility to manage this in a future driver release.

Thank you. That worked in bringing the cudaMalloc time down.

But there is still a difference in the kernel execution time. The program executes in a very tight loop, just calling the kernel for an iteration count that can go upto 1000. The kernel is about 4% slower.

Anything else I should run to bring everything on par between these two Tesla platforms?

Thank you,

Bala