You are probably measure not the cudaMemcpy() performance but your kernel’s performance. You must call cudaThreadSynchronize() before beginning your timer, or else the synchronization between the CPU and GPU happens during the memcpy and takes up most of the time.
The reason your kernel is ending early is probably because it is accessing an illegal address. Look at an SDK sample and see the calls that are made to check for errors immediately after a kernel execution.
I tried to use CUT_DEVICE_INIT(argc, argv); in my main function, and then use CUDA_SAFE_CALL whenever I use cudaMalloc() cudaMemcpy() and cudaFree(). There is no error message, but the program still doesn’t work.
The problem is strange. My program process one image at time. If I use an image with size X as input, it works. Then if I use another image with size Y, then the program will fail. Once the program failed, it won’t work even I process the image with size X again. I thought maybe somewhere I didn’t clean up the memory, but I checked my program several times, everything is cleaned. So I don’t know what could cause the problem.
What i was thinking of is the macro CUT_CHECK_ERROR(). Use it after every kernel call (except keep in mind the real CUT_CHECK_ERROR() only works in debug builds)
I think your kernel is accessing memory out-of-bounds and causing an exception. Before it crashes it seems to overwrite critical cuda-related portions of GPU memory. Or something. Anyway, it’s not unusual for a crashed kernel to need a reboot to clean up, at least on some platforms. Out of curiosity, which OS are you using?