Using CUDA Unified memory on embedded board (psychical unified memory)

Hi All,

– If it is asked before, I am sorry I could not catch that topic –

I have a jetson tk-1 board that has physical unified memory that is shared by cpu and gpu. I have two question about using CUDA unified mem on this board.

  1. If I want to write different part of same big data array by gpu and cpu at the same time( parallel execution cpu and gpu ), Is that possible ? or Does it make sense at all ? perhaps, cuda unified memory is only for producer-consumer model ? of course I am responsible of data race.
  2. When I compare zero-copy and unified memory, is unified memory always better ? Because is the reason cuda unified memory that activates cache of CPU of Jetson ?

Thank you very much in advance

It currently does not fit the UM execution model:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive

“In general, it is not permitted for the CPU to access any managed allocations or variables while the GPU is active.”

I think some folks have found zero-copy to be the faster approach on Jetson where the memory is physically unified. You might need to benchmark the differences for your code and test case. There are no coherency guarantees when using zero-copy. There is effectively a coherency (CPU/GPU) guarantee with UM, see above.

in my experience using zero-copy buffers and user managed coherency ( cudaStreamSynchronize(…), etc ) is the most efficient approach on TK1 when you need really “tight” CPU/GPU concurrency.

I don’t Think there is a real performance benefit with UM but rather ease of use.

Hi All,

Thank you very much for your answer and sharing experience. They are quite useful.

So when you say “efficient coherency” @txbob what did you mean ? Can unified memory do something better than user managed coherency ? Such as less data usage, more cache usage etc.

@Jimmy Zero copy is cool but to manage coherency I need to duplicate data and migrate somehow. Unified memory does automatically thats why i am into that.

I don’t think I said “efficient coherency” anywhere in my posting.

Uh sorry :) fast writing. Effectively coherency that I wanted to say.

I said:

There is CPU/GPU “coherency” because of the exclusive access provision of UM that I already pointed out and gave the document link for. Did you read it?

What it says is that when the GPU has access to a UM memory space, the CPU does not. And when the CPU has access to a UM memory space, the GPU does not. This is the nature of the current implementation (pre-CUDA 8/P100) of UM (and which is the implementation that would be relevant for Jetson TK1/TX1, the subject of this thread.)

Therefore, when the CPU is accessing the memory space, all previous GPU traffic to that space is guaranteed to be complete and coherent. Likewise when the GPU is access the memory space, all previous CPU traffic to that space is guaranteed to be complete and coherent.