Managed memory vs cudaHostAlloc - TK1

I’m working on the TK1 and have encountered a problem with performance.

If I use cudaMallocManaged for a large array, say 200 MB, and then access much of the array in a kernel I’ve made, performance is fast.

However, if I make the array 400 MB, my program is slower, regardless of if I’m still doing the same amount of work. It seems as if I’m being punished for having a larger array.

My best guess is that each cudaDeviceSynchronize is “copying” the entire array from device to host, as is pointed out in bullet 5 of this thread:

I don’t understand why this copy takes place on a TK1 anyways since it’s unified memory.

When I allocate the vector using cudaHostAlloc instead of managed, performance is tremendously slower. The one function which takes much longer uses some atomicAdd calls, but I don’t understand why using managed memory would be any different than cudaHostAlloc on a device with unified memory.

And if I have to use managed memory, is there a way to increase buffer size without suffering due to the entire array being copied at each synchronize stage?

It is a memory model presented with a single address space…copy to and from one place in memory to another must still take place. As the programmer you just don’t have to manually do this…either the MMU or a kernel module are doing this for you.

What exactly does “copying to and from one place” mean then?

Is there a way to do this on my own so it doesn’t copy the entire allocated array space?

If I have a loop where I need to do the following operations:

  1. Allocate 250 MB buffer.
  2. Perform operation where I only keep 10% of the data.
  3. Process the remaining 25 MB on CPU.

Performing cudaDeviceSynchronize copies the entire 250 MB array every time, even though I only need it to copy the 25 MB.

Using MMU still is faster than doing zero-copy or cudaMalloc for me because I will be reusing the entire buffer each new iteration. It seems there should be a way to not synchronize the entire amount of managed memory if I have only used a small amount.

You would be better off asking this on the CUDA forums. Embedded forums may not be read by CUDA developers.

A unified memory model does not change the fact that underlying parts of the system are spread out into different locations. It’s just a mapping system. Initial programming is simplified, but doing the things you need to do manually will still give you a performance boost. Earlier CUDA versions required more setup for memory, but someone doing this would probably set up exactly what they need, whereas the simpler “newer” versions with unified memory abstract the process. The abstraction is probably not as smart/optimized as someone writing code piece by piece.

@milliarde, this might help?

I am also using Jetson TK1 to write my application. I also noticed that, using of more cudamallocmanaged memory kills the system performance. I am not doing any cudaDeviceSynchronize calls in code. Can some one enlighten me what could be the reason for this behaviour. Is it the same case on discrete GPUs as well?


Hi sivaramakrishna,

This is because cudaMallocmanaged is page-locked. As more and more pages are page-locked, the system performance goes down. GPU on TK1 does not support page-faults. So all memory accessible to GPU needs to be page-locked.