What exactly does the managed memory flag do and what changes?

I am a Jetson AGX Xavier user, and this device has no problem at all to use Unified memory. Right?
I have a question about UM (unified memory), so let me explain the situation first.
For example, for inference of object detection, I will use UM 10MB as the memory to contain parameters, and the flag is default (cudaMemAttachGlobal). If there are 10 layers of inference using 1MB parameters, 10MB would be sufficient, right?

The code would work roughly like this.

cudaMallocManaged(&buf,10MB,cudaMemAttachGlobal);
int offset = 0;
for(int i = 0; i<10; i++){
  layer l = &network.layers[i];
  fread(buf + offset, 1MB, fp); // Read 1MB of parameters from disk
  l->buf_gpu = buf + offset;
  kernel<<< ...>>>(l->buf_gpu); // 1MB from buffer start address + offset is used by GPU
  offset += 1MB; // Offset increased by 1MB
}

This code will execute the kernel 10 times, moving offset from the buf address.

What I do not understand is that the above code does not work if the cudaMemAttachGlobal flag is used. However, if the cudaMemAttachHost flag is used, the above code works well.

I read the documentation on managed memory, and for the AttachGlobal flag, it is written as always open to cpu and gpu, and for the AttachHost flag, it is written as if conditional access is possible. Actually, this part seems to need more detailed explanation in the docs. (CUDA Runtime API :: CUDA Toolkit Documentation)

So I thought it would work well with AttachGlobal, but it didn’t. Why?
To execute the above code using buf(AttachGlobal), I have to insert cudaStreamSynchronize() after kernel. But AttachHost doesn’t need synchronize() and works just fine.
I know that the CPU can access the buffer even if the buffer is being used by the GPU.
Am I wrong?

Of course, my actual code is more complex, so the problem may be caused by other parts. But I’m asking because I think this problem is caused by memory access.

I’ll wait for your reply.
thank you!

have you tried a cudaDeviceSynchronize() following the kernel call?

Quoting Robert Crovella, from another thread

Yes.
I’ve tried using cudaStreamSynchronize() or cudaDeviceSynchronize() after the kernel call and it works fine.

But using synchronize() after the kernel call is not what I want. This is because we wanted to read the next parameter into the same memory while the kernel was running on the GPU. As you can see from the code, it is never designed to access the same memory address.

And what you quoted says jetson managed memory cannot be accessed concurrently, why do you enable concurrent access using the cudaMemAttachHost flag? Shouldn’t this be used?

But it’s going to write to the exact the same memory PAGE, unless your input buffer is exactly aligned to fall into different pages. (I am assuming the 10MB you gave in your source code is 10e6 bytes, not 10 MiB = 1.048.576 Bytes)

Not sure what the exact page size is, some sources say it’s 64kiB for unified memory, other sources say it’s 4kiB.

What is the meaning of cudaMemAttachGlobal, a flag of cudaMallocManaged(), different from cudaMemAttachHost? Why can the above code be executed asynchronously on cudaMemAttachHost as I want?

The documentation says.

If cudaMemAttachHost is specified, then the allocation should not be accessed from devices that have a zero value for the device attribute cudaDevAttrConcurrentManagedAccess

I am pretty sure that Jetson devices don’t have a cudaDevAttrConcurrentManagedAccess property of 1
This can be checked with such code:

int attr = 0;
cudaDeviceGetAttribute(&attr, cudaDevAttrConcurrentManagedAccess,0);
std::cout << attr << std::endl;

What I do not understand is why cudaMemAttachGlobal works and what its implications are for performance.