small memcpy memcpy takes too long with small data

Is there any way to speed up the results of small memcpys? In my kernel calls the memcpy of the results from device memory to the main memory takes the most time. I am only copying 24 bytes, but the call takes 600 us.

I am using page-locked memory, it makes no difference leading me to conclude that the overhead associated with the memcpy is the factor.

I’ve observed about 20 microseconds of latency or overhead in cudaMemcpy on Linux. It doesn’t really matter whether you use pageable or page-locked memory for this case. What system are you running on where you observe 600 microseconds for a small transfer?

Actually it’s using VS2005, under XP on a P4. I think the 600us figure may be an artifact of my instrumentation. I was inserting fprintfs right into the code, and that probably affected the cache performance of the processor.

When I log the timestamp in memory directly, I get a figure closer to 20us. Thanks.!