Managed memory vs cudaHostAlloc - TK1

Milliarde · July 31, 2014, 1:57pm

I’m working on the TK1 and have encountered a problem with performance.

If I use cudaMallocManaged for a large array, say 200 MB, and then access much of the array in a kernel I’ve made, performance is fast.

However, if I make the array 400 MB, my program is slower, regardless of if I’m still doing the same amount of work. It seems as if I’m being punished for having a larger array.

My best guess is that each cudaDeviceSynchronize is “copying” the entire array from device to host, as is pointed out in bullet 5 of this thread:

I don’t understand why this copy takes place on a TK1 anyways since it’s unified memory.

When I allocate the vector using cudaHostAlloc instead of managed, performance is tremendously slower. The one function which takes much longer uses some atomicAdd calls, but I don’t understand why using managed memory would be any different than cudaHostAlloc on a device with unified memory.

And if I have to use managed memory, is there a way to increase buffer size without suffering due to the entire array being copied at each synchronize stage?

linuxdev · July 31, 2014, 8:06pm

It is a memory model presented with a single address space…copy to and from one place in memory to another must still take place. As the programmer you just don’t have to manually do this…either the MMU or a kernel module are doing this for you.

Milliarde · July 31, 2014, 9:04pm

What exactly does “copying to and from one place” mean then?

Is there a way to do this on my own so it doesn’t copy the entire allocated array space?

If I have a loop where I need to do the following operations:

Allocate 250 MB buffer.
Perform operation where I only keep 10% of the data.
Process the remaining 25 MB on CPU.

Performing cudaDeviceSynchronize copies the entire 250 MB array every time, even though I only need it to copy the 25 MB.

Using MMU still is faster than doing zero-copy or cudaMalloc for me because I will be reusing the entire buffer each new iteration. It seems there should be a way to not synchronize the entire amount of managed memory if I have only used a small amount.

linuxdev · July 31, 2014, 9:22pm

You would be better off asking this on the CUDA forums. Embedded forums may not be read by CUDA developers.

A unified memory model does not change the fact that underlying parts of the system are spread out into different locations. It’s just a mapping system. Initial programming is simplified, but doing the things you need to do manually will still give you a performance boost. Earlier CUDA versions required more setup for memory, but someone doing this would probably set up exactly what they need, whereas the simpler “newer” versions with unified memory abstract the process. The abstraction is probably not as smart/optimized as someone writing code piece by piece.

allanmac · August 1, 2014, 12:15am

@milliarde, this might help?

[url]Managed memory vs cudaHostAlloc - TK1 - CUDA Programming and Performance - NVIDIA Developer Forums

SRK_bharadwaj · February 4, 2016, 7:53am

Hi,
I am also using Jetson TK1 to write my application. I also noticed that, using of more cudamallocmanaged memory kills the system performance. I am not doing any cudaDeviceSynchronize calls in code. Can some one enlighten me what could be the reason for this behaviour. Is it the same case on discrete GPUs as well?

Thanks
sivaramakrishna

kayccc · February 15, 2016, 8:24am

Hi sivaramakrishna,

This is because cudaMallocmanaged is page-locked. As more and more pages are page-locked, the system performance goes down. GPU on TK1 does not support page-faults. So all memory accessible to GPU needs to be page-locked.

Thanks

Topic		Replies	Views
Managed memory vs cudaHostAlloc - TK1 CUDA Programming and Performance	10	6119	February 22, 2016
CUDA memory performance Jetson TK1	3	1120	October 18, 2021
cudaHostAlloc vs cudaMallocHost vs cudaMallocManaged Jetson TK1	2	4031	October 20, 2016
uncached memory created by cudaHostAlloc and cudaMemcpyAsync issues on TX1 Jetson TX1	3	1721	July 15, 2016
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1712	October 18, 2021
Is cudaMallocmanaged with cudaMemAttachHost flag is faster than malloc? Jetson TK1	1	1417	February 15, 2016
Zero Copy vs. CudaMemcpy on Jetson TK1 Jetson TK1	4	1512	May 18, 2016
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1592	October 26, 2020
Unified Memory On TX1 Jetson TX1	4	855	October 18, 2021

Managed memory vs cudaHostAlloc - TK1

Related topics