DEBUG vs RELEASE data transfer times Unexpected results

Hi, I get some rather unexpected results when transferring a large data array from CUDA to PC in RELEASE and DEBUG modes. In both cases I used absolutely the same code and parameters.

For example: array size ~ 4MBs
Transfer time in DEBUG: ~4.5ms
Transfer Time in RELEASE: ~128ms

Time measurements are conducted with cuda timers around the memory copy function call. no other code in between. I checked it couple of times, for different grid sizes and similar, but I always get similar results.

HW is GeForce 8800 GTX.

Do you have any idea what could have gone wrong here?

Also, it look like there is always some latency of ~0.02-0.03ms when issuing a memory copy call, and that the transfer times do not increase linearly with the size of transfered data, at least not for relatively small sizes. Is there somewhere more detailed information on this topic?

DEBUG builds are always slower because they omit optimization and include additional stuff like stack checks which can slow down execution considerably. So, I guess this is host code problem.

I suggest you to enable CUDA profiler and check memcpy timings for debug and release builds; they should very close.

As for memcpy performance, yes, there is some overhead of issuing copy operation, for my machine about 15-25us. I haven’t seen description in documentation, so you may try searching this forum.

Sorry there was a mistake in my post.

Strange thing about the data transfer times that I’m getting, is that RELEASE values are much worse than the DEBUG values.

Do you run DEBUG in emulation mode? There is quite a big overhead starting the GPU kernel, and your times indicates a small workload, so maybe the overhead here is much more then the gain…?

Anyway, try CUDA profiler. It will tell you whether the problem is in driver or in host code (I’m almost sure it’s somewhere in your code or nvcc options).

Do you have 2 CUDA compatible cards in the system?