I am a Jetson AGX Xavier user, and this device has no problem at all to use Unified memory. Right?
I have a question about UM (unified memory), so let me explain the situation first.
For example, for inference of object detection, I will use UM 10MB as the memory to contain parameters, and the flag is default (cudaMemAttachGlobal). If there are 10 layers of inference using 1MB parameters, 10MB would be sufficient, right?
The code would work roughly like this.
cudaMallocManaged(&buf,10MB,cudaMemAttachGlobal);
int offset = 0;
for(int i = 0; i<10; i++){
layer l = &network.layers[i];
fread(buf + offset, 1MB, fp); // Read 1MB of parameters from disk
l->buf_gpu = buf + offset;
kernel<<< ...>>>(l->buf_gpu); // 1MB from buffer start address + offset is used by GPU
offset += 1MB; // Offset increased by 1MB
}
This code will execute the kernel 10 times, moving offset from the buf address.
What I do not understand is that the above code does not work if the cudaMemAttachGlobal flag is used. However, if the cudaMemAttachHost flag is used, the above code works well.
I read the documentation on managed memory, and for the AttachGlobal flag, it is written as always open to cpu and gpu, and for the AttachHost flag, it is written as if conditional access is possible. Actually, this part seems to need more detailed explanation in the docs. (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd228014f19cc0975ebe3e0dd2af6dd1b)
So I thought it would work well with AttachGlobal, but it didn’t. Why?
To execute the above code using buf(AttachGlobal), I have to insert cudaStreamSynchronize() after kernel. But AttachHost doesn’t need synchronize() and works just fine.
I know that the CPU can access the buffer even if the buffer is being used by the GPU.
Am I wrong?
Of course, my actual code is more complex, so the problem may be caused by other parts. But I’m asking because I think this problem is caused by memory access.
I’ll wait for your reply.
thank you!