Using cudaHostRegister() in CUDA 4.0 CUDA 4.0


CUDA 4.0 manual says that the pointer and the size passed to this API needs to be aligned with the page size(4KB).
But usual malloc() does not allocate a chunk of memory that satisfies this requirement.
Even if the requested size to malloc() is multiples of page size, the returned address is not aligned to the page size.

If I want to use this API, then do I have to allocate more memory than needed and manually align it to the page size?
Is there an easy way of doing it to make cudaHostRegister() API more useful?
Or is there a special malloc() that allocates aligned memory?

mmap or valloc in Linux, VirtualAlloc in Windows

If mmap() pointers work, does it mean you can use it to take advantage of the OS’s own file-to-memory mapping feature (in mmap()) to allow the GPU to directly read and write files on disk?

It may be pretty inefficient (wow, what latency!), but it would still be cool even in theory.

As soon as you call cudaHostRegister() the pages will be read and locked in memory, I guess. But yes, it’s cooler than using fread().

SPWorley: no, you can’t use cudaHostRegister with IO-mapped memory.

Thanks for the replies.

tmurray: Is there a plan for releasing the new version of cudaHostRegister() that does not have this constraint (page size alignment) in the future?

I’d believe this to be impossible since pages are the basic unit in which memory is managed by the MMU.

It is true that pages are the basic units of MMU,
but what I mentioned could be implemented by page-locking all the pages related to the given address region
even when the address is not aligned to the page size.


I’ve not yet had a look at this, but I plan to use this feature.

Is it not possible to give cudaHostRegister() a start pointer and a size that would encompass the range return by malloc (or any other allocation function) while being aligned to the page size ?

Best regards,



self replying on this topic :

it seems that it is actually possible to pagelock a 4KB-aligned range encompassing a malloc’ed range.

However, it is then not possible to perform a cudaMemcpyAsync on the malloc’ed range, my bet is that the pointer given to cudaMemcpyAsync must also be 4KB aligned …

@neohpcer: if you can choose the way you allocate ranges that you want to lock, you can use aligned_malloc, aligned_free etc … defined in stdlib.h

However, in a real-case scenario when one doesn’t managed memory, is there a way to take benefit of page-locking memory ?



Developers can work around the page alignment constraint by rounding the address from malloc() down to the next page boundary, and rounding the size up accordingly. Discovering the page size is platform-dependent. On Windows, call GetSystemInfo() and look at SYSTEM_INFO::dwPageSize. On Linux, call getpagesize().

The only side effect is that <4K on each end of the allocation will be made nonpageable - harmless.

This should work on all CUDA target platforms, which only enforce memory protections on page boundaries.

@Julien: there are no alignment constraints on the inputs to cudaMemcpyAsync(). Any host address ranges have been pinned by CUDA are valid - page alignment is not required.

Yes, you can - if you control the IO-mapping in the fs or the driver. The KGPU project has an initial try of this, current code can allow you to share pages in your vma with CUDA in page-locked format. See more inside the code: KGPU-github



I have looked around a lot these past few days to find a solution to my problem which is related to cudaHostRegister():

I have an application that allocates memory in normal c/c++ code which is passed to my function/module that needs to copy 100+ MB to a GTX 460 gpu device. Using paged memory was too slow for my scenario, so I looked into page-locked memory. Allocating the memory using cudaHostAlloc is no option as a memory copy would double the needed main memory and I cannot rewrite the application above to allocate the memory with cudaHostAlloc. I found cudaHostRegister in the CUDA API and thought this would solve my problem but the bandwidth between RAM and GPU only gets to 1.3 GB/s. If I manually change my module to use cudaHostAlloc and copy the data I get 5.6 GB/s using the same data and timing methods.

Is the page-locked memory created by cudaHostRegister different from the page-locked memory allocated by cudaHostAlloc?

And if yes, how so? According to the CUDA Programming Guide, section 3.2.4, I am assuming that cudaHostAlloc and cudaHostRegister both create page-locked memory that can use the DMA transfer efficiently.

Here is a short code snippet of the things I am doing.

const unsigned int nrOfElements = 10000000;

int * data;

size_t sysPageSize = sysconf(_SC_PAGESIZE);

// make sure size is a multiple of sysPageSize

unsigned int vectorSizeInBytes = sysPageSize * ((sizeof(int) * elements + (sysPageSize - 1)) / sysPageSize);

// allocate page-aligned memory

int memErr;

if ((memErr = posix_memalign((void **)(&data), sysPageSize, vectorSizeInBytes))) {

    if(memErr == EINVAL)

        std::cout << "einval" << std::endl;


        std::cout << "enomem" << std::endl;

    throw "could not allocate memory";


/* ... initialize data ... */

/* call the function passing the page-aligned memory pointer */

int * input_d_raw;

size_t numbytes = ((int)((elements * sizeof(int) + 4095)/4096)) * 4096;

cudaError_t err = cudaHostRegister((void *) data, numbytes, 0);

if(err != cudaSuccess)

    std::cout << "Unable to use pinned memory." << std::endl;

cudaMalloc((void **) &input_d_raw, sizeof(int) * elements);

cudaMemcpyAsync((void *) input_d_raw, (void*) data, elements * sizeof(int), cudaMemcpyHostToDevice, 0);


cudaHostUnregister((void *) data);

Thanks in advance for any hints that may help to solve my problem.

i think the cuda sdk 4.0’s sample code can help you

the source directory simpleZeroCopy

see the source code

about this

#define MEMORY_ALIGNMENT  4096

#define ALIGN_UP(x,size) ( ((size_t)x+(size-1))&(~(size-1)) )
a_UA = (float *) malloc( bytes + MEMORY_ALIGNMENT );

    b_UA = (float *) malloc( bytes + MEMORY_ALIGNMENT );

    c_UA = (float *) malloc( bytes + MEMORY_ALIGNMENT );

// We need to ensure memory is aligned to 4K (so we will need to padd memory accordingly)

    a = (float *) ALIGN_UP( a_UA, MEMORY_ALIGNMENT );

    b = (float *) ALIGN_UP( b_UA, MEMORY_ALIGNMENT );

    c = (float *) ALIGN_UP( c_UA, MEMORY_ALIGNMENT );

cutilSafeCall(cudaHostRegister(a, bytes, CU_MEMHOSTALLOC_DEVICEMAP));

    cutilSafeCall(cudaHostRegister(b, bytes, CU_MEMHOSTALLOC_DEVICEMAP));

    cutilSafeCall(cudaHostRegister(c, bytes, CU_MEMHOSTALLOC_DEVICEMAP));

if the a_UA’s address is 0x00000001

then the a’s address is 0x00001000

the MARCO ALIGN_UP work like this

(0x00000001 + 0x00000fff) & (0xfffff000) = 0x00001000

0x00000001 is the worst case for align 4k

you can try it for any address

it’s useful english is very poor…i hope you can understand what i say…

FYI in CUDA 4.1 the restrictions on alignment and size are gone. You can host register a generic pointer.

Even if we use that generic pointer on HtoDAsync and DtoHAsync commands? Does it give error in any platform with this?