I have 2 systems, one with a Tesla c1060 and the oher with Tesla m1060. I am running identical programs on both. When I am doing cudaMalloc() (about 32Mb), the c1060 takes about 51ms, but the m1060 takes 1.4 seconds.
Anyone know what I should do to bring the time down on the m1060?
presumably you’re not running X on the M1060 box and are on the C1060 box. there’s a period of time to initialize internal driver state that gets torn down when there are no clients of the Linux driver, which is why that warmup time is there. there will be a utility to manage this in a future driver release.
Thank you. That worked in bringing the cudaMalloc time down.
But there is still a difference in the kernel execution time. The program executes in a very tight loop, just calling the kernel for an iteration count that can go upto 1000. The kernel is about 4% slower.
Anything else I should run to bring everything on par between these two Tesla platforms?