Managed memory vs cudaHostAlloc - TK1

Milliarde · July 31, 2014, 9:45pm

I’m working on the TK1 and have encountered a problem with performance.

If I use cudaMallocManaged for a large array, say 200 MB, and then access much of the array in a kernel I’ve made, performance is fast.

However, if I make the array 400 MB, my program is slower, regardless of if I’m still doing the same amount of work. It seems as if I’m being punished for having a larger array.

My best guess is that each cudaDeviceSynchronize is “copying” the entire array from device to host, as is pointed out in bullet 5 of this thread:

I don’t understand why this copy takes place on a TK1 anyways since it’s unified memory.

When I allocate the vector using cudaHostAlloc instead of managed, performance is tremendously slower. The one function which takes much longer uses some atomicAdd calls, but I don’t understand why using managed memory would be any different than cudaHostAlloc on a device with unified memory.

And if I have to use managed memory, is there a way to increase buffer size without suffering due to the entire array being copied at each synchronize stage?

allanmac · July 31, 2014, 11:45pm

On the TK1 you can achieve fast access to pinned memory if you make sure you call “cudaSetDeviceFlags(cudaDeviceMapHost)” before calling “cudaSetDevice(0)”.

With a modified bandwidthTest, this raises the D>H and H>D copying performance to 6.25 GB/sec (i.e. 12.5 GB/sec):

The reported throughput is identical whether it’s copying 32MB or 400MB.

I wonder if this flag would help Managed Memory performance? I’ve been sticking to pinned/mapped/write-combined memory for now but am very interested in what kind of performance (and pitfalls) other people are seeing with the Managed Memory APIs.

As you note, @seibert’s write-up revealed some important things to be aware of when using the MM API on discrete GPUs. Updating that knowledge for the K1 would be useful.

Perhaps someone from NVIDIA can crank out all these combinations of memory types and their performance on discrete and unified architectures?

Milliarde · August 1, 2014, 3:14pm

I don’t believe I’ve called “cudaSetDevice()” anywhere already.

My two attempts have been:

Create large array buffers using cudaMallocManaged.
Create large array buffers using cudaHostAlloc.

For one CUDA function in particular, attempt 2 made things far slower, which may be due to a large number of atomic adds.

Hence I wish to stick with cudaMallocManaged since all attempts seem to make no improvement (whether using pinned/mapped/write-combined flag).

It is just unfortunate that I can’t control whether it syncs the entire contents of the managed memory, since I’m only using the entire array at the start of each loop.

allanmac · August 1, 2014, 5:08pm

If no call to cudaSetDevice() is made, the current device is implicitly device 0.

You should probably try enabling that flag – at least when using cudaHostAlloc().

Milliarde · August 4, 2014, 2:12pm

I tried enabling the flag but it seems to have had no impact on the runtime. From the documentation, it seems that any calls to cudaHostGetDevicePointer() will return an error if it hasn’t been called, but I’ve called it in the past without setting the flag and didn’t get any errors.

I don’t think it’s an issue of the device not being able to access the memory, but somehow the memory being much more slowly accessible than when the memory is managed.

If I increase the buffer size, the functions in which a cudaDeviceSynchronize() is called become a few ms slower (by the same amount).

The increase in time aligns perfectly with my device to host or host to device bandwidth as calculated in the bandwidthTest utility. So for now I suppose I will deal with the slight increase in transfer time when I need to allocate a larger buffer.

allanmac · August 4, 2014, 5:31pm

Do you have a small test that demonstrates this behavior?

Also, how long is the delay that you’re seeing? Milliseconds? Microseconds?

It would be useful to see what “nvprof ” reports for different size arrays.

smartvoice · May 28, 2015, 6:10am

Hey Allan, I am also running test on TK1. However, I didn’t have the same results like you did.
Running on…

Device 0: GK20A
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 996.2

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3692.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3997.6

Result = PASS

I am wondering what’s your configuration?

Thanks,

allanmac:

On the TK1 you can achieve fast access to pinned memory if you make sure you call “cudaSetDeviceFlags(cudaDeviceMapHost)” before calling “cudaSetDevice(0)”.

With a modified bandwidthTest, this raises the D>H and H>D copying performance to 6.25 GB/sec (i.e. 12.5 GB/sec):

The reported throughput is identical whether it’s copying 32MB or 400MB.

I wonder if this flag would help Managed Memory performance? I’ve been sticking to pinned/mapped/write-combined memory for now but am very interested in what kind of performance (and pitfalls) other people are seeing with the Managed Memory APIs.

As you note, @seibert’s write-up revealed some important things to be aware of when using the MM API on discrete GPUs. Updating that knowledge for the K1 would be useful.

Perhaps someone from NVIDIA can crank out all these combinations of memory types and their performance on discrete and unified architectures?

SRK_bharadwaj · February 4, 2016, 7:51am

Hi,
I am also using Jetson TK1 to write my application. I also noticed that, using of more cudamallocmanaged memory kills the system performance. I am not doing any cudaDeviceSynchronize calls in code. Can some one enlighten me what could be the reason for this behaviour. Is it the same case on discrete GPUs as well?

Thanks
sivaramakrishna

BulatZiganshin · February 4, 2016, 10:53am

i have no idea of these GPUs but usually desktop and smartphone graphics allocate some memory area for GPU exclusive use. yes, although the memory is physically the same, GPU has its own area for its work. so when you are going outside of this area size, it probably starts swapping between cpu-dedicated and gpu-dedicated parts of memory, implementaing a sort of virtual memory (we have already discussed that discrete gpus does it with wddm drivers)

may be it can be fixed by allocating more gpu-dedicated memory via driver or so

BulatZiganshin · February 10, 2016, 8:26pm

oh, i see that it was alreasy described here: https://devtalk.nvidia.com/default/topic/754874/jetson-tk1/tk1-memory-bandwidth/post/4246020/#4246020

HannesF99 · February 22, 2016, 9:10am

There seems to be a talk at GTC 2016 about this topic.
S6435 - Analyzing the Behavior of Memory Accesses on the NVIDIA Jetson TK1
http://mygtc.gputechconf.com/quicklink/hZqfrQ0

Topic		Replies	Views
Managed memory vs cudaHostAlloc - TK1 Jetson TK1	6	2008	February 15, 2016
CUDA memory performance Jetson TK1	3	1120	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
cudaHostAlloc vs cudaMallocHost vs cudaMallocManaged Jetson TK1	2	4031	October 20, 2016
Is cudaHostAlloc() fast? CUDA Programming and Performance	5	408	March 28, 2024
New to CUDA having memory transfer issues CUDA Programming and Performance	16	1984	April 18, 2017
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3245	August 31, 2018
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1592	October 26, 2020
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8772	November 17, 2021
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	3828	January 7, 2022

Managed memory vs cudaHostAlloc - TK1

Related topics