Managed Memory is slower on host than memory allocated with new[]

Good afternoon,

I’m trying out Managed Memory on an AGX Orin, hoping to cut out the transfer times. I’m using cudaStreamAttachMemAsync() to attach Managed Memory buffers to a stream when running a kernel, and using the same to attach the buffer to the host when I want to check the results of the kernel. When I use Managed Memory, checking the results takes 80ms. However, if I switch to standard memory (allocated via new[ ] on the host, cudaMalloc on the device), checking the results only takes 40ms. The data I’m running with is 15,000,000 floats.

Basically, Managed Memory without copies saved me 30 ms ( 40 ms with copies, 10 ms without copies), but it cost me 40 ms ( 80 ms with Managed Memory, 40 ms with normal memory) when it came time to use the results. Is this expected?

CUDA is version 11.4.4.



Just want to confirm that have you applied cudaMemcpy to copy the managed memory data back to CPU?
This is not required since managed memory is synchronized by the GPU driver directly.

More, could you try to maximize the device’s performance to see if any difference?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks


Since this is an AGX Orin, I didn’t think I needed the cudaMemcpy, since the CPU and GPU use the same RAM chips. There’s documentation on using ManagedMemory on embedded hardware to skip the memory transfers entirely. (I don’t mean have them done automatically, I mean skip them.)

I’m not going to change the clocks: we’ve had issues with units overheating when we do that. I don’t see why that would change anything though: the CPU side is running at the same clock speed regardless of whether the memory was allocated by new[] or cudaMallocManaged, right?

Bah. I missed the line in the Tegra notes document that Unified Memory has an overhead penalty. That’s probably what I’m seeing. CUDA for Tegra, section 4.1., Memory Selection. I may try pinned memory at a later date.


Is your issue fixed after switching to pinned memory?

Haven’t had time to try, and probably won’t for a couple weeks. Just went back to our original code, which uses cudaMalloc and std::vector.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.