Unified Memory On TX1

Allen_Z · April 10, 2018, 3:11am

I’m using ‘Unified Memory’ on tx1 to get a better IO performance. Normally it works fine and also I can get a good performance on memory operations. But once I requested too much memory using cudaMallocManaged, I got some errors. When changed the memory allocation method with cudaMalloc for some memory, the error disappeared.

The error sometime like: ‘Unlunched kernel error or segment error’

How many Unified memory can I get through cudaMallocManaged API on Jetson TX1? Could I have some tools to monitor the memory or cuda related resource limitation and its usage ?
How to get more detailed error information in such case on tx1? (I tried cuda-gdb but it can not work, conflict with other application)
Do I need cudadevicesync to guaranteed cache consistency when using Unified memory? (I lunch the cuda kernel with stream and there is cudaStreamSynchronize after the cuda kernel)

AastaLLL · April 10, 2018, 6:27am

Hi,

1. Guess that you meet a known issue which is already fixed in JetPack3.1.
Here is the detail and solution for your reference:
https://devtalk.nvidia.com/default/topic/1013464/jetson-tx2/gpu-out-of-memory-when-the-total-ram-usage-is-2-8g/post/5172688/#5172688

2. Please run your application with cuda-memcheck.

cuda-memcheck [app]

3. Unified memory on Jetson requires exclusive access.
Please remember to call cudaDeviceSynchronize() to make sure the memory is available for CPU.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive

Thanks.

Allen_Z · April 10, 2018, 7:27am

Hi, AastaLLL

Thanks for your answers.

I’ll try to upgrade our system and recheck.

Good tools !

In such case, if I changed the output memory with cudaMallockHost and I don’t touch any memory alloced by cudaMallocManaged, do I still need a cudaDeviceSynchronize()? and how about the IO performance now?

Thanks.

AastaLLL · April 12, 2018, 2:47am

Hi,

cudaDeviceSynchronize() is required when touching unified memory with CPU after kernel execution.
The coherence is automatically handled by GPU driver. Usually user can ignore the IO transfer.

Here is a tutorial of unified memory for your reference:
[url]https://devblogs.nvidia.com/unified-memory-cuda-beginners/[/url]

Thanks

Topic		Replies	Views
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
Unified Memory Access using Jetson TX2 Jetson TX2	5	2324	October 18, 2021
Jetson TK1, CUDA 6: How can texture memory be combined with unified memory? Jetson TK1	0	1211	July 9, 2014
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1712	October 18, 2021
Unified memory and concurrent C++ objects Jetson TX2	10	2502	October 18, 2021
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1721	July 15, 2016
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2008	February 15, 2016
Unified memory not working completely Jetson TX1	4	1404	October 18, 2021
Cuda memory access with cudaMallocManaged CUDA Programming and Performance camera , cuda	3	86	September 11, 2024
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3245	August 31, 2018

Unified Memory On TX1

Related topics