Zero-copy: is cudaSetDeviceFlags(cudaDeviceMapHost) actually needed?

thearchermdd · April 21, 2024, 12:52am

I’m trying Zero-Copy with some kernels on an Orin. It’s working fine, and there’s no performance hit unless I try using an atomic function on a mapped-pinned host buffer. However, on a whim, I commented out the call to cudaSetDeviceFlags(cudaDeviceMapHost). My unit tests still work just fine. Is that call actually needed these days? (Currently using CUDA 11.4.4. I don’t know when we’ll upgrade to 12.x.)

I ask because I don’t relish the thought of trying to find the main() of all our unit tests and applications (of which we have around 75) just to add code to set that device flag in an appropriate location.

Robert_Crovella · April 21, 2024, 6:26pm

It’s not needed. the flag is automatically set on 64-bit linux OS. See here.

Unified addressing is automatically enabled in 64-bit processes .

All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible from all devices that support unified addressing. This is the case regardless of whether or not the flags cudaHostAllocPortable and cudaHostAllocMapped are specified.

3571669089 · February 23, 2025, 2:58am

What about cudaMallocManaged() ?And when we use this functions,do we need to use cudaSetDeviec(int) to setup the device with multi-gpus?

Robert_Crovella · February 23, 2025, 3:03am

cudaMallocManaged is unrelated to zero-copy, pinned memory, or cudaDeviceMapHost. Having said that, cudaMallocManaged may “default” to using pinned memory in certain scenarios, however this is not directly managed by the user, and no configuration flags are relevant/needed/possible.

cudaMallocManaged in typical usage does not have a device association; instead the typical behavior is migration of pages to the device (or processor) that touches them. Additional tutorial information is in unit 6 of this online series.

3571669089 · February 24, 2025, 2:25am

Thanks very much,and more questions:

When wu use cudaMallocHost and cudaHostAlloc,should we use cudaSetDevice?
What is the size of the page-locked memory or pinned memory?Is there difference between this kind of memory?
Will the cudaMallocManaged malloced memory be moved to other device as a whole page,and I am confusing about that if a small size memory malloced by cudaMallocManaged will affect the performance?

Robert_Crovella · February 24, 2025, 7:20pm

These allocate host memory, not associated with a particular device. cudaSetDevice should be irrelevant.

They are the same, in my usage/experience, at least as for as modern CUDA goes in a 64-bit OS. The limits are not specified. The limits are generally OS specific, and its reasonable to assume an “upper bound” is the available host memory, but actual experience may vary (e.g. a limit of half of host memory, or 2GB, or some other thing). I don’t have a formula or specification for the maximum possible. Furthermore, allocating all of host memory as pinned is generally not advisable; this is contrary to smooth operation of the VMM in the host OS.

In a demand-paged environment, the page faults generate movement page-by-page. In a non-demand-paged environment (e.g. windows) the managed allocation are generally moved en-masse at the point of a kernel launch. You can also avoid page-by-page (i.e. page faults) by “prefetching” the data via cudaMemPrefetchAsync.

The link I already provided for the online tutorial series covers this in more depth - unit 6 covers managed memory and unit 7 mentions pinned memory. The CUDA programming guide (again, already posted here) also has considerable documentation on managed memory. And there are approximately 10^6 forum questions on these topics, that you can find with a bit of searching. Finally, managed memory performance has its own blog article.

Topic		Replies	Views
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	16232	March 30, 2022
Documentation of cudaSetDeviceFlags incomplete? cudaDeviceMapHost flag missing in list of valid flag CUDA Programming and Performance	2	3345	July 13, 2011
cudaMallocHost() vs cudaHostAlloc(cudaHostAllocPortable) CUDA Programming and Performance	1	4915	August 22, 2013
cudaMallocHost confusion CUDA Programming and Performance	6	9979	June 24, 2011
a question about cudaMallocManaged（） CUDA Programming and Performance	4	630	November 17, 2018
device memory collision from cudaHostGetDevicePointer CUDA Programming and Performance	6	899	April 8, 2011
can I use pinned memory? CUDA Programming and Performance	6	2734	September 21, 2009
How to control where cudaMallocManaged allocates buffer (device or host) Nsight Visual Studio Edition cuda	0	422	March 23, 2020
Problem using zero-copy / mapped memory Cuda 2.2 beta CUDA Programming and Performance	5	13501	March 19, 2009
cudaMallocHost How to use CUDA Programming and Performance	6	35770	April 26, 2012

Zero-copy: is cudaSetDeviceFlags(cudaDeviceMapHost) actually needed?

Related topics