cudaMallocHost confusion

Hi all,

The more I read about cudaMallocHost the more confused I get.

After reading just the CUDA reference manual, I was under the impression that cudaMallocHost allocates memory that is directly accessible to both the device and the host.

After reading more here and there, people seem to use cudaMallocHost to accelerate host - device copies, which seems to imply that this memory is not directly accessible to the device.

(Of course, these are not mutually exclusive).

I have attached a very small piece of code. On my platform, it segfaults when using cudaAlloc, and gives the output below using cudaAllocHost.

array: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
array: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Is this the expected behavior?

I am using cudaMallocHost to share a datastructure between the host and the device. Since this data structure is quite complex (nested arrays of structs, so many pointers) it seemed a convenient way to avoid the “deep copy”. Was that a stupid decision?

Thanks,
test.cu (916 Bytes)

Hi,

2 things :

  • cudaMalloc allocates memory on GPU => not accessible directly from CPU code
  • cudaMallocHost allocates memory on system RAM => not directly accessible “as-is” from GPU with the pointer.

Then, if you want to access memory allocated with cudaMallocHost directly from GPU, you will have to use cudaHostGetDevicePointer() to get a pointer that is valid in GPU code. The mechanism is called “zero-copy” (if you want to google that)

The situation here changed very recently with CUDA 4.0, which has made things a little confusing.

Before CUDA 4.0:

Making host memory directly accessible on the device required two steps. First, you had to allocate the memory using cudaHostAlloc(), which has a superset of the capabilities of cudaMallocHost(). (Presumably, they did not want to change cudaMallocHost for backward compatibility reasons.) In particular, you needed the cudaHostAllocMapped flag, which allocated page-locked (“pinned”) host memory and also mapped that memory into the address space of your CUDA device. However, the pointer addresses were not portable between host and device, so to actually pass a pointer to this host-side block of memory to a kernel, you had to then call cudaHostGetDevicePointer() to find the device-side address. As you can imagine, this makes it basically impossible for host and device to operate on data structures that contain other pointers.

After CUDA 4.0:

With the release of CUDA 4.0, there is another option available, but only if you are using a compute capability 2.x device and a 64-bit OS. (Furthermore, this new option doesn’t work with Windows Vista or 7 unless your CUDA device is running with the “TCC” driver that is only available for Tesla cards.) If you meet all those requirements, then the new Unified Virtual Addressing means that pointers are uniquely defined globally. According to the release notes:

Now, the release notes discuss the simplification of memory copies, but don’t say exactly how this impacts mapped memory. Hopefully someone can clarify.

Thanks for the answers so far.

Unfortunately, I am still confused.

My code sample suggests that (on my setup) the memory allocated with cudaMallocHost is directly available to both the host and device without using cudaHostGetDevicePointer(), which is exactly my goal (due to the many pointers in my data structure, see post of seibert).

I am using the latest CUDA version (I assume 4.0), Linux x86_64, and a Quadro 5000/PCI/SSE2 (although I am not compiling with “-arch=sm_20”, except for this example code which uses printf()).

Am I correct to assume that on my setup, because of what seibert said, my code sample is “officially supported”. Or is it in the realm of “undefined behavior” and was/am I just lucky?

That would be a Fermi card on 64-bit Linux, so yes, you’re using UVA, and therefore all pinned memory allocations are both portable and mapped into the device address space.

Any chance of this working in the other direction? Transparent access to device memory contents on the host directly through the device pointers? I’m sure someone out there is dying to make a linked list that snakes across several CUDA devices and the host memory. :)

It seems that this also works in the other direction.

This can easily be verified by extending my code sample (modify the array on the device, cudaThreadSynchronize, print the array on the host).

Thanks,