uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1

rownine · July 8, 2016, 3:16am

It has confirmed that memory created by cudaHostAlloc API is uncached with very bad performance of memory accessing.
[ref] https://devtalk.nvidia.com/default/topic/922626/?comment=4834970

And the hostPtr of cudaMemcpyAsync[d->h] must point to paged-locked host memeory for any overlap to occur.

So the the issue is:

If we want benefit from cudaMemcpyAsync[d->h] behaving asynchronously, page-locked host memory should be allocate through cudaHostAlloc. We can not do any other memory accessing to the page-locked memory, as it is uncached and has poor performance. How to benifit from cudaMemcpyAsync[d->h] behaving asynchronously with any other memory accessing to the host memory efficiently?

Thanks!

kayccc · July 12, 2016, 8:44am

Hi rownine,

Better allocator to use is cudaMallocManaged. This allocator would return memory that can be used both on CPU and GPU.
The memory is cached on CPU. Explicit migration to GPU is not needed with memcpy.

See unified memory programming guide section in CUDA programming Guide doc:
[url]Programming Guide :: CUDA Toolkit Documentation

Thanks

rownine · July 13, 2016, 3:15am

Thank you！ One more question about managed memory：

Is there implicit memcpy between CPU and GPU, as it is integrated GPU on TX1. We have tried same program on 2 TX1 board, and one cost more time on managed memory accessing on CPU after kernel finished but the other board has nearly no extra time cost. This result confused me.

kayccc · July 15, 2016, 5:30am

Hi rownine,

When using managed memory the CPU<->GPU transfers are made using appropriate cache-operations.
There is no memcpy involved. When using managed memory, the kernel execution time is slightly higher because
the driver would need to memory transfers using cache-ops. This transfer/cache-op time shows up in the kernel
execution.

Hint: Your program will run more optimally if you attach only the (managed) buffers that you use on the stream with cudaStreamAttachMemAsync. The driver will then know that only this memory is being used on the stream and
transfer only that.

Thanks