Hi, I’m running a code on two different cards. The first one is a 9400GT and the second one a GTX280. I use global memory to store the results of my kernel. The problem is that when I copy the results to host, on 9400GT I get a cpu time of around 2ms for 1,6mbytes which I find normal, but on gtx280 I get 0,9s for the same amount of data. I tried to use page-locked memory and although the code runs fine on 9400gt and the copy time dropped to 1,3ms, on GTX280 the same code produced a segmentation fault and the card was somehow “locked”. I was not able to use it afterwards as it shows up as busy or not available. I have tried to make a simple copy from device to host of 1,6mbytes and more and the time was normal. Has anyone experienced such a problem? Note, I use nvidia’s profiler and cuda 4.0 . OS is ubuntu on both systems but different editions.