uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1

It has confirmed that memory created by cudaHostAlloc API is uncached with very bad performance of memory accessing.
[ref] https://devtalk.nvidia.com/default/topic/922626/?comment=4834970

And the hostPtr of cudaMemcpyAsync[d->h] must point to paged-locked host memeory for any overlap to occur.

So the the issue is:

If we want benefit from cudaMemcpyAsync[d->h] behaving asynchronously, page-locked host memory should be allocate through cudaHostAlloc. We can not do any other memory accessing to the page-locked memory, as it is uncached and has poor performance. How to benifit from cudaMemcpyAsync[d->h] behaving asynchronously with any other memory accessing to the host memory efficiently?


Hi rownine,

Better allocator to use is cudaMallocManaged. This allocator would return memory that can be used both on CPU and GPU.
The memory is cached on CPU. Explicit migration to GPU is not needed with memcpy.

See unified memory programming guide section in CUDA programming Guide doc:


Thank you! One more question about managed memory:

Is there implicit memcpy between CPU and GPU, as it is integrated GPU on TX1. We have tried same program on 2 TX1 board, and one cost more time on managed memory accessing on CPU after kernel finished but the other board has nearly no extra time cost. This result confused me.

Hi rownine,

When using managed memory the CPU<->GPU transfers are made using appropriate cache-operations.
There is no memcpy involved. When using managed memory, the kernel execution time is slightly higher because
the driver would need to memory transfers using cache-ops. This transfer/cache-op time shows up in the kernel

Hint: Your program will run more optimally if you attach only the (managed) buffers that you use on the stream with cudaStreamAttachMemAsync. The driver will then know that only this memory is being used on the stream and
transfer only that.