cudaHostRegister and Fortran

In the code I currently work with, many of the host arrays that are copied to my device kernels are allocated far “above” the actual computation routines. As you traverse the tree down from, for example, physics to moist processes, by then an array for temperature is actually a pointer to that space allocated far above.

This setup has often stymied my ability to try async/double buffering, use faster memory transfers, &c. since I’d need pinned memory and it’s a bit difficult to pinpoint, exactly, who first allocated that array to pin it there. (I could, of course, allocate new pinned memory buffers inside my local routines and work with them, but that would be doubling the host memory needed for quite a few, very large arrays. Plus, allocating pinned memory is slow, so doing it every timestep is not good.)

So, I was intrigued when I learned of cudaHostRegister in CUDA 4.0. It seems like it would help in that I could register, say, my local temperature pointer and then gain the chance for faster transfers, the possibility of async. copies, et al.

But that leads to my question: reading the CUDA Fortran User’s Guide, am I right in thinking that I can’t HostRegister a Fortran array/pointer? Rather I’d need to use iso_c_binding and have fun with posix_memalign, c_ptr, c_f_pointer, &c.? (In which case, I’d be valloc’ing a new array and doubling space again…)

Thanks for any help with this and other questions sure to surface as I explore all the new 4.0 routines.


The cudaHostRegister implementation in CUDA 4.0 is a little difficult for us to work with in CUDA Fortran. Currently, it requires the buffer you would like to pin to be aligned on a 4K boundary, and of a size that is a multiple of 4K.

In general, the address of your buffer is not that accessible in Fortran, so this is a difficult concept for the language.

We’ve heard of some plans from NVIDIA to alleviate some of these issues, probably in CUDA 4.1, but no promises there.

PGI is working on an extension to CUDA Fortran, an “ALIGN” qualifier on the allocate statement, that will, as you point out, sit atop the underlying posix_memalign() or other platform-specific routine to return an aligned buffer area.

Again, no promises when that will be ready, but hopefully in the next few releases. If you allocate an area aligned on a 4K boundary, then that can be padded to the right size, to be acceptable by the current cudaHostRegister implementation. Then it is just up to you as a programmer to properly allocate the arrays that you may possibly want pinned later on.

Hope that helps.

  • Brent

Has the status of cudaHostRegister changed? Can I page-lock a chunk of host memory without iso_c_binding?

Even with iso_c_binding, I’m having no luck:

use iso_c_binding
double precision, target :: junk(1024)
double precision, device, allocatable :: d_junk(:)
integer :: istat

istat = cudaHostRegister(c_loc(junk), 1024, 0)
istat = cudaMemcpyAsync(d_junk, junk, 1024)

gives a runtime failure

0: copyin Memcpy (dev=0x500200000, host=0x68f040, size=8192) FAILED: 11(invalid argument)

without the call to cudaHostRegister, all is well. Can anyone see why?

Two possible issues here:

The current implementation of cudaHostRegister has not been overloaded to take all datatypes like many of the other API routines, so you need to pass it c_loc() of the array like you did. But, then, the count is in bytes, so you need to pass 1024*8.

That should probably do the trick. Then, if the Async memcpy is actually happening asychnronously, you might be running into trouble deallocating the array before the transfer happens! (Maybe the CUDA runtime synchronizes those two operations, I’m not sure…)

If you use the cudafor module, it includes iso_c_binding, so you don’t need to use that explicitly.

That did it. Thanks.