Unified Memory On TX1

I’m using ‘Unified Memory’ on tx1 to get a better IO performance. Normally it works fine and also I can get a good performance on memory operations. But once I requested too much memory using cudaMallocManaged, I got some errors. When changed the memory allocation method with cudaMalloc for some memory, the error disappeared.

The error sometime like: ‘Unlunched kernel error or segment error’

  1. How many Unified memory can I get through cudaMallocManaged API on Jetson TX1? Could I have some tools to monitor the memory or cuda related resource limitation and its usage ?

  2. How to get more detailed error information in such case on tx1? (I tried cuda-gdb but it can not work, conflict with other application)

  3. Do I need cudadevicesync to guaranteed cache consistency when using Unified memory? (I lunch the cuda kernel with stream and there is cudaStreamSynchronize after the cuda kernel)


1. Guess that you meet a known issue which is already fixed in JetPack3.1.
Here is the detail and solution for your reference:

2. Please run your application with cuda-memcheck.

cuda-memcheck [app]

3. Unified memory on Jetson requires exclusive access.
Please remember to call cudaDeviceSynchronize() to make sure the memory is available for CPU.


Hi, AastaLLL

Thanks for your answers.

I’ll try to upgrade our system and recheck.

Good tools !

In such case, if I changed the output memory with cudaMallockHost and I don’t touch any memory alloced by cudaMallocManaged, do I still need a cudaDeviceSynchronize()? and how about the IO performance now?



cudaDeviceSynchronize() is required when touching unified memory with CPU after kernel execution.
The coherence is automatically handled by GPU driver. Usually user can ignore the IO transfer.

Here is a tutorial of unified memory for your reference: