I have 2 systems, one with a Tesla c1060 and the oher with Tesla m1060. I am running identical programs on both. When I am doing cudaMalloc() (about 32Mb), the c1060 takes about 51ms, but the m1060 takes 1.4 seconds.
Anyone know what I should do to bring the time down on the m1060?
Thanks in advance for any help.