I’m trying Zero-Copy with some kernels on an Orin. It’s working fine, and there’s no performance hit unless I try using an atomic function on a mapped-pinned host buffer. However, on a whim, I commented out the call to cudaSetDeviceFlags(cudaDeviceMapHost). My unit tests still work just fine. Is that call actually needed these days? (Currently using CUDA 11.4.4. I don’t know when we’ll upgrade to 12.x.)
I ask because I don’t relish the thought of trying to find the main() of all our unit tests and applications (of which we have around 75) just to add code to set that device flag in an appropriate location.
It’s not needed. the flag is automatically set on 64-bit linux OS. See here.
Unified addressing is automatically enabled in 64-bit processes .
All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible from all devices that support unified addressing. This is the case regardless of whether or not the flags cudaHostAllocPortable and cudaHostAllocMapped are specified.
What about cudaMallocManaged() ?And when we use this functions,do we need to use cudaSetDeviec(int) to setup the device with multi-gpus?
cudaMallocManaged is unrelated to zero-copy, pinned memory, or cudaDeviceMapHost. Having said that, cudaMallocManaged may “default” to using pinned memory in certain scenarios, however this is not directly managed by the user, and no configuration flags are relevant/needed/possible.
cudaMallocManaged in typical usage does not have a device association; instead the typical behavior is migration of pages to the device (or processor) that touches them. Additional tutorial information is in unit 6 of this online series.
Thanks very much,and more questions:
- When wu use
cudaMallocHost and cudaHostAlloc,should we use cudaSetDevice?
- What is the size of the page-locked memory or pinned memory?Is there difference between this kind of memory?
- Will the
cudaMallocManaged malloced memory be moved to other device as a whole page,and I am confusing about that if a small size memory malloced by cudaMallocManaged will affect the performance?
These allocate host memory, not associated with a particular device. cudaSetDevice should be irrelevant.
They are the same, in my usage/experience, at least as for as modern CUDA goes in a 64-bit OS. The limits are not specified. The limits are generally OS specific, and its reasonable to assume an “upper bound” is the available host memory, but actual experience may vary (e.g. a limit of half of host memory, or 2GB, or some other thing). I don’t have a formula or specification for the maximum possible. Furthermore, allocating all of host memory as pinned is generally not advisable; this is contrary to smooth operation of the VMM in the host OS.
In a demand-paged environment, the page faults generate movement page-by-page. In a non-demand-paged environment (e.g. windows) the managed allocation are generally moved en-masse at the point of a kernel launch. You can also avoid page-by-page (i.e. page faults) by “prefetching” the data via cudaMemPrefetchAsync.
The link I already provided for the online tutorial series covers this in more depth - unit 6 covers managed memory and unit 7 mentions pinned memory. The CUDA programming guide (again, already posted here) also has considerable documentation on managed memory. And there are approximately 10^6 forum questions on these topics, that you can find with a bit of searching. Finally, managed memory performance has its own blog article.