What exactly does the managed memory flag do and what changes?

dkqhzm2 · January 12, 2022, 1:18pm

I am a Jetson AGX Xavier user, and this device has no problem at all to use Unified memory. Right?
I have a question about UM (unified memory), so let me explain the situation first.
For example, for inference of object detection, I will use UM 10MB as the memory to contain parameters, and the flag is default (cudaMemAttachGlobal). If there are 10 layers of inference using 1MB parameters, 10MB would be sufficient, right?

The code would work roughly like this.

cudaMallocManaged(&buf,10MB,cudaMemAttachGlobal);
int offset = 0;
for(int i = 0; i<10; i++){
  layer l = &network.layers[i];
  fread(buf + offset, 1MB, fp); // Read 1MB of parameters from disk
  l->buf_gpu = buf + offset;
  kernel<<< ...>>>(l->buf_gpu); // 1MB from buffer start address + offset is used by GPU
  offset += 1MB; // Offset increased by 1MB
}

This code will execute the kernel 10 times, moving offset from the buf address.

What I do not understand is that the above code does not work if the cudaMemAttachGlobal flag is used. However, if the cudaMemAttachHost flag is used, the above code works well.

I read the documentation on managed memory, and for the AttachGlobal flag, it is written as always open to cpu and gpu, and for the AttachHost flag, it is written as if conditional access is possible. Actually, this part seems to need more detailed explanation in the docs. (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd228014f19cc0975ebe3e0dd2af6dd1b)

So I thought it would work well with AttachGlobal, but it didn’t. Why?
To execute the above code using buf(AttachGlobal), I have to insert cudaStreamSynchronize() after kernel. But AttachHost doesn’t need synchronize() and works just fine.
I know that the CPU can access the buffer even if the buffer is being used by the GPU.
Am I wrong?

Of course, my actual code is more complex, so the problem may be caused by other parts. But I’m asking because I think this problem is caused by memory access.

I’ll wait for your reply.
thank you!

cbuchner1 · January 12, 2022, 2:07pm

have you tried a cudaDeviceSynchronize() following the kernel call?

Quoting Robert Crovella, from another thread

dkqhzm2 · January 12, 2022, 2:27pm

Yes.
I’ve tried using cudaStreamSynchronize() or cudaDeviceSynchronize() after the kernel call and it works fine.

But using synchronize() after the kernel call is not what I want. This is because we wanted to read the next parameter into the same memory while the kernel was running on the GPU. As you can see from the code, it is never designed to access the same memory address.

And what you quoted says jetson managed memory cannot be accessed concurrently, why do you enable concurrent access using the cudaMemAttachHost flag? Shouldn’t this be used?

cbuchner1 · January 12, 2022, 2:32pm

But it’s going to write to the exact the same memory PAGE, unless your input buffer is exactly aligned to fall into different pages. (I am assuming the 10MB you gave in your source code is 10e6 bytes, not 10 MiB = 1.048.576 Bytes)

Not sure what the exact page size is, some sources say it’s 64kiB for unified memory, other sources say it’s 4kiB.

dkqhzm2 · January 12, 2022, 2:45pm

What is the meaning of cudaMemAttachGlobal, a flag of cudaMallocManaged(), different from cudaMemAttachHost? Why can the above code be executed asynchronously on cudaMemAttachHost as I want?

cbuchner1 · January 12, 2022, 3:09pm

The documentation says.

If cudaMemAttachHost is specified, then the allocation should not be accessed from devices that have a zero value for the device attribute cudaDevAttrConcurrentManagedAccess

I am pretty sure that Jetson devices don’t have a cudaDevAttrConcurrentManagedAccess property of 1
This can be checked with such code:

int attr = 0;
cudaDeviceGetAttribute(&attr, cudaDevAttrConcurrentManagedAccess,0);
std::cout << attr << std::endl;

What I do not understand is why cudaMemAttachGlobal works and what its implications are for performance.

Topic		Replies	Views
Unified Memory Access using Jetson TX2 Jetson TX2	5	2325	October 18, 2021
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1713	October 18, 2021
Pascal & capabilities 6.0 show cudaDevAttrConcurrentManagedAccess is 0 CUDA Programming and Performance	15	1366	December 27, 2018
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	34	November 25, 2024
Kernel invocation invalidates unified memory blocks CUDA Programming and Performance	9	1068	January 8, 2018
Unified memory and concurrent C++ objects Jetson TX2	10	2508	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1617	October 18, 2021
cudaStreamAttachMemAsync behavior questions GPU-Accelerated Libraries	0	1655	September 19, 2016
cudaMallocManaged with cudaMemAttachHost Jetson AGX Orin cuda	2	524	October 13, 2022
Concurrent CPU and GPU processing Jetson TX1	12	1635	October 18, 2021

What exactly does the managed memory flag do and what changes?

Related topics