Difference between cudamallocmanaged and malloc/new

memory allocated using cudamallocmanaged or malloc/new when accessed from GPUs works, cudamallocmanaged gives way better performance. Memory management behind cudamallocmanaged is well documented. My question is, what is going on when GPU’s access memory allocated via malloc/new, i.e. is the data transferred as bytes (as needed) or also as pages? Any references to where its documented would be great!

(host code) malloc/new for device code access only works on devices/systems where that feature is available. Currently, this would be x86 systems with the HMM, or Grace with ATS feature. Here are a couple recent threads:

1 2 3

which also link to more info.

Hi,

My understanding about Grace with the ATS feature is that as GPU threads keep accessing cache lines, these accesses are served from CPU main memory (LPDDR). Below you can see how these accesses are made:

As GPU thread keeps accessing cache lines up to some accesses, these are served through ATS from the host memory. But later, data migration happens and the GPU thread accesses the remaining data from its DRAM (HBM). The exact mechanism,i.e. how many accesses are needed for data migration and how much data (pages) are migrated, is unclear.

CudaMallocManaged allocates memory on the host side and brings memory to the GPU memory when a page fault occurs. The reason CudaMallocManaged gives better performance is that it involves data migration + access to device memory while accessing malloc/new-allocated memory (host-side) involves access to host memory (>4x higher latency) + data migration + access to device memory.

I hope this answers your question.
Best.