Trouble profiling async return values when host memory pinned

I’m attempting to write a program that imports a short amount of data (128-512 complex pairs), does several operations (including multiple FFT’s) on that block plus the previous 16-256 blocks, and then returns one pair of values (essentially the maximum value and location of the correlation of a reference to the historical data coming in).

In order to run real-time, the cycle time needs to be ~1 msec, so doing synchronous moves between the host and device won’t cut it. I’ve created two streams and ping-pong incoming blocks (and their associated follow-on computations) between them. After profiling (using CUDA 4.0 on a 2.1 compute device for now), got some results that showed a lot of ‘dead’ time (no host computations or moves) even though (after the initialization routines) all thread launches were through the streams and copies were asynch using the matched stream. I saw on the the profiler that the first of the two values returned to the host was taking 320 microseconds on the CPU even though the GPU time was quite small (4 microseconds).

Thought it might be a good idea to try and use pinned or maybe even mapped memory to return these values to the host, but when I do this the profiler complains and chops off any result after the first stream’s write back to the host. This happens with either a cudaMemcpyAsync back to pinned, but not mapped memory, or an in kernel write to the mapped pointer.

I’ve appropriately called the needed device flags to allow mapping (I already use in on the host->device transfer), and well as fetching the device side pointer to the host memory using cudaHostGetDevicePointer. I’ve used zero-copies for years going out to the device without any problems (previous applications were always output to a display device, or if they brought something back, it was for debugging purposes and was never meant to be fast, i.e. never used pinned memory for return values).

If I change the return pointer to be a pageable host pointer and use cudaMemcpyAsync, the profiler again works fine. Why can’t I use the profiler with pinned host memory on a device->host transfer? Additionally, why all the dead time when everything I’m doing is asynchronous? May have to look into creating two entire threads to ping/pong…

Don’t know why, but this is now working when using a zero-copy back to the host using pinned., mapped memory. Haven’t tried using discrete ‘cudaMemcpyAsync’ commands with pinned only host memory.