Managed memory vs cudaHostAlloc - TK1

I’m working on the TK1 and have encountered a problem with performance.

If I use cudaMallocManaged for a large array, say 200 MB, and then access much of the array in a kernel I’ve made, performance is fast.

However, if I make the array 400 MB, my program is slower, regardless of if I’m still doing the same amount of work. It seems as if I’m being punished for having a larger array.

My best guess is that each cudaDeviceSynchronize is “copying” the entire array from device to host, as is pointed out in bullet 5 of this thread:

https://devtalk.nvidia.com/default/topic/695408/?comment=4134257

I don’t understand why this copy takes place on a TK1 anyways since it’s unified memory.

When I allocate the vector using cudaHostAlloc instead of managed, performance is tremendously slower. The one function which takes much longer uses some atomicAdd calls, but I don’t understand why using managed memory would be any different than cudaHostAlloc on a device with unified memory.

And if I have to use managed memory, is there a way to increase buffer size without suffering due to the entire array being copied at each synchronize stage?

On the TK1 you can achieve fast access to pinned memory if you make sure you call “cudaSetDeviceFlags(cudaDeviceMapHost)” before calling “cudaSetDevice(0)”.

With a modified bandwidthTest, this raises the D>H and H>D copying performance to 6.25 GB/sec (i.e. 12.5 GB/sec):

The reported throughput is identical whether it’s copying 32MB or 400MB.

I wonder if this flag would help Managed Memory performance? I’ve been sticking to pinned/mapped/write-combined memory for now but am very interested in what kind of performance (and pitfalls) other people are seeing with the Managed Memory APIs.

As you note, @seibert’s write-up revealed some important things to be aware of when using the MM API on discrete GPUs. Updating that knowledge for the K1 would be useful.

Perhaps someone from NVIDIA can crank out all these combinations of memory types and their performance on discrete and unified architectures?

I don’t believe I’ve called “cudaSetDevice()” anywhere already.

My two attempts have been:

  1. Create large array buffers using cudaMallocManaged.
  2. Create large array buffers using cudaHostAlloc.

For one CUDA function in particular, attempt 2 made things far slower, which may be due to a large number of atomic adds.

Hence I wish to stick with cudaMallocManaged since all attempts seem to make no improvement (whether using pinned/mapped/write-combined flag).

It is just unfortunate that I can’t control whether it syncs the entire contents of the managed memory, since I’m only using the entire array at the start of each loop.

If no call to cudaSetDevice() is made, the current device is implicitly device 0.

You should probably try enabling that flag – at least when using cudaHostAlloc().

I tried enabling the flag but it seems to have had no impact on the runtime. From the documentation, it seems that any calls to cudaHostGetDevicePointer() will return an error if it hasn’t been called, but I’ve called it in the past without setting the flag and didn’t get any errors.

I don’t think it’s an issue of the device not being able to access the memory, but somehow the memory being much more slowly accessible than when the memory is managed.

If I increase the buffer size, the functions in which a cudaDeviceSynchronize() is called become a few ms slower (by the same amount).

The increase in time aligns perfectly with my device to host or host to device bandwidth as calculated in the bandwidthTest utility. So for now I suppose I will deal with the slight increase in transfer time when I need to allocate a larger buffer.

Do you have a small test that demonstrates this behavior?

Also, how long is the delay that you’re seeing? Milliseconds? Microseconds?

It would be useful to see what "nvprof " reports for different size arrays.

Hey Allan, I am also running test on TK1. However, I didn’t have the same results like you did.
Running on…

Device 0: GK20A
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 996.2

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3692.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3997.6

Result = PASS

I am wondering what’s your configuration?

Thanks,

Hi,
I am also using Jetson TK1 to write my application. I also noticed that, using of more cudamallocmanaged memory kills the system performance. I am not doing any cudaDeviceSynchronize calls in code. Can some one enlighten me what could be the reason for this behaviour. Is it the same case on discrete GPUs as well?

Thanks
sivaramakrishna

i have no idea of these GPUs but usually desktop and smartphone graphics allocate some memory area for GPU exclusive use. yes, although the memory is physically the same, GPU has its own area for its work. so when you are going outside of this area size, it probably starts swapping between cpu-dedicated and gpu-dedicated parts of memory, implementaing a sort of virtual memory (we have already discussed that discrete gpus does it with wddm drivers)

may be it can be fixed by allocating more gpu-dedicated memory via driver or so

oh, i see that it was alreasy described here: https://devtalk.nvidia.com/default/topic/754874/jetson-tk1/tk1-memory-bandwidth/post/4246020/#4246020

There seems to be a talk at GTC 2016 about this topic.
S6435 - Analyzing the Behavior of Memory Accesses on the NVIDIA Jetson TK1
http://mygtc.gputechconf.com/quicklink/hZqfrQ0