cudaMallocHost confusion

areslagae · June 23, 2011, 1:19pm

Hi all,

The more I read about cudaMallocHost the more confused I get.

After reading just the CUDA reference manual, I was under the impression that cudaMallocHost allocates memory that is directly accessible to both the device and the host.

After reading more here and there, people seem to use cudaMallocHost to accelerate host - device copies, which seems to imply that this memory is not directly accessible to the device.

(Of course, these are not mutually exclusive).

I have attached a very small piece of code. On my platform, it segfaults when using cudaAlloc, and gives the output below using cudaAllocHost.

array: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
array: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Is this the expected behavior?

I am using cudaMallocHost to share a datastructure between the host and the device. Since this data structure is quite complex (nested arrays of structs, so many pointers) it seemed a convenient way to avoid the “deep copy”. Was that a stupid decision?

Thanks,
test.cu (916 Bytes)

Cuda_Libre · June 23, 2011, 1:46pm

Hi,

2 things :

cudaMalloc allocates memory on GPU => not accessible directly from CPU code
cudaMallocHost allocates memory on system RAM => not directly accessible “as-is” from GPU with the pointer.

Then, if you want to access memory allocated with cudaMallocHost directly from GPU, you will have to use cudaHostGetDevicePointer() to get a pointer that is valid in GPU code. The mechanism is called “zero-copy” (if you want to google that)

seibert · June 23, 2011, 2:00pm

The situation here changed very recently with CUDA 4.0, which has made things a little confusing.

Before CUDA 4.0:

Making host memory directly accessible on the device required two steps. First, you had to allocate the memory using cudaHostAlloc(), which has a superset of the capabilities of cudaMallocHost(). (Presumably, they did not want to change cudaMallocHost for backward compatibility reasons.) In particular, you needed the cudaHostAllocMapped flag, which allocated page-locked (“pinned”) host memory and also mapped that memory into the address space of your CUDA device. However, the pointer addresses were not portable between host and device, so to actually pass a pointer to this host-side block of memory to a kernel, you had to then call cudaHostGetDevicePointer() to find the device-side address. As you can imagine, this makes it basically impossible for host and device to operate on data structures that contain other pointers.

After CUDA 4.0:

With the release of CUDA 4.0, there is another option available, but only if you are using a compute capability 2.x device and a 64-bit OS. (Furthermore, this new option doesn’t work with Windows Vista or 7 unless your CUDA device is running with the “TCC” driver that is only available for Tesla cards.) If you meet all those requirements, then the new Unified Virtual Addressing means that pointers are uniquely defined globally. According to the release notes:

(Windows and Linux) Added support for unified virtual address space.

Devices supporting 64-bit and compute 2.0 and higher capability now share a single unified address space between the host and all devices. This means that the pointer used to access memory on the host is the same as the pointer to used to access memory on the device. Therefore, the location of memory may be queried directly from its pointer value; the direction of a memory copy need not be specified.

The function cudaPointerGetAttribute in the runtime API (and cuPointerGetAttribute in the driver API) may be used to query attributes about a pointer. The copy direction cudaMemcpyDefault in the runtime API (and the functions cuMemcpy, its variants, and the memory type CU_MEMORYTYPE_UNIFIED in the driver API) may be used to copy data without specifying the direction.

Note that this functionality is available only on Linux-64, Windows XP-64, and Windows Vista/7 using the TCC driver model.

Now, the release notes discuss the simplification of memory copies, but don’t say exactly how this impacts mapped memory. Hopefully someone can clarify.

areslagae · June 23, 2011, 2:30pm

Thanks for the answers so far.

Unfortunately, I am still confused.

My code sample suggests that (on my setup) the memory allocated with cudaMallocHost is directly available to both the host and device without using cudaHostGetDevicePointer(), which is exactly my goal (due to the many pointers in my data structure, see post of seibert).

I am using the latest CUDA version (I assume 4.0), Linux x86_64, and a Quadro 5000/PCI/SSE2 (although I am not compiling with “-arch=sm_20”, except for this example code which uses printf()).

Am I correct to assume that on my setup, because of what seibert said, my code sample is “officially supported”. Or is it in the realm of “undefined behavior” and was/am I just lucky?

tmurray · June 23, 2011, 4:27pm

That would be a Fermi card on 64-bit Linux, so yes, you’re using UVA, and therefore all pinned memory allocations are both portable and mapped into the device address space.

seibert · June 23, 2011, 11:41pm

Any chance of this working in the other direction? Transparent access to device memory contents on the host directly through the device pointers? I’m sure someone out there is dying to make a linked list that snakes across several CUDA devices and the host memory. :)

areslagae · June 24, 2011, 8:30am

It seems that this also works in the other direction.

This can easily be verified by extending my code sample (modify the array on the device, cudaThreadSynchronize, print the array on the host).

Thanks,

Topic		Replies	Views
Difference between cudaMallocManaged and cudaMallocHost CUDA Programming and Performance cuda	3	12999	March 30, 2022
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4525	February 8, 2010
cudaMalloc causes segmentation fault 2 Mo is far from my 1,2 Go card memory limit CUDA Programming and Performance	7	7476	June 28, 2011
cudaMallocHost How to use CUDA Programming and Performance	6	35385	April 26, 2012
Simple cudaMallocHost beginner question CUDA Programming and Performance	5	2719	September 29, 2008
Why is cudaMallocHost() so slow? CUDA Programming and Performance	7	8869	November 17, 2021
Using cudaHostRegister() in CUDA 4.0 CUDA 4.0 CUDA Programming and Performance	16	30240	January 25, 2018
Low performance for CPU accessing page-locked memory? CUDA Programming and Performance	3	610	March 7, 2019
How get in host the memory allocated from device CUDA Programming and Performance	10	3025	August 16, 2017
selfmade cudeMallocHost()? CUDA Programming and Performance	9	8657	February 14, 2008

cudaMallocHost confusion

Related topics