CUDA 4.0 manual says that the pointer and the size passed to this API needs to be aligned with the page size(4KB).
But usual malloc() does not allocate a chunk of memory that satisfies this requirement.
Even if the requested size to malloc() is multiples of page size, the returned address is not aligned to the page size.
If I want to use this API, then do I have to allocate more memory than needed and manually align it to the page size?
Is there an easy way of doing it to make cudaHostRegister() API more useful?
Or is there a special malloc() that allocates aligned memory?
Thanks.
If mmap() pointers work, does it mean you can use it to take advantage of the OS’s own file-to-memory mapping feature (in mmap()) to allow the GPU to directly read and write files on disk?
It may be pretty inefficient (wow, what latency!), but it would still be cool even in theory.
It is true that pages are the basic units of MMU,
but what I mentioned could be implemented by page-locking all the pages related to the given address region
even when the address is not aligned to the page size.
I’ve not yet had a look at this, but I plan to use this feature.
Is it not possible to give cudaHostRegister() a start pointer and a size that would encompass the range return by malloc (or any other allocation function) while being aligned to the page size ?
it seems that it is actually possible to pagelock a 4KB-aligned range encompassing a malloc’ed range.
However, it is then not possible to perform a cudaMemcpyAsync on the malloc’ed range, my bet is that the pointer given to cudaMemcpyAsync must also be 4KB aligned …
@neohpcer: if you can choose the way you allocate ranges that you want to lock, you can use aligned_malloc, aligned_free etc … defined in stdlib.h
However, in a real-case scenario when one doesn’t managed memory, is there a way to take benefit of page-locking memory ?
Developers can work around the page alignment constraint by rounding the address from malloc() down to the next page boundary, and rounding the size up accordingly. Discovering the page size is platform-dependent. On Windows, call GetSystemInfo() and look at SYSTEM_INFO::dwPageSize. On Linux, call getpagesize().
The only side effect is that <4K on each end of the allocation will be made nonpageable - harmless.
This should work on all CUDA target platforms, which only enforce memory protections on page boundaries.
@Julien: there are no alignment constraints on the inputs to cudaMemcpyAsync(). Any host address ranges have been pinned by CUDA are valid - page alignment is not required.
Yes, you can - if you control the IO-mapping in the fs or the driver. The KGPU project has an initial try of this, current code can allow you to share pages in your vma with CUDA in page-locked format. See more inside the code: KGPU-github
I have looked around a lot these past few days to find a solution to my problem which is related to cudaHostRegister():
I have an application that allocates memory in normal c/c++ code which is passed to my function/module that needs to copy 100+ MB to a GTX 460 gpu device. Using paged memory was too slow for my scenario, so I looked into page-locked memory. Allocating the memory using cudaHostAlloc is no option as a memory copy would double the needed main memory and I cannot rewrite the application above to allocate the memory with cudaHostAlloc. I found cudaHostRegister in the CUDA API and thought this would solve my problem but the bandwidth between RAM and GPU only gets to 1.3 GB/s. If I manually change my module to use cudaHostAlloc and copy the data I get 5.6 GB/s using the same data and timing methods.
Is the page-locked memory created by cudaHostRegister different from the page-locked memory allocated by cudaHostAlloc?
And if yes, how so? According to the CUDA Programming Guide, section 3.2.4, I am assuming that cudaHostAlloc and cudaHostRegister both create page-locked memory that can use the DMA transfer efficiently.
Here is a short code snippet of the things I am doing.
const unsigned int nrOfElements = 10000000;
int * data;
size_t sysPageSize = sysconf(_SC_PAGESIZE);
// make sure size is a multiple of sysPageSize
unsigned int vectorSizeInBytes = sysPageSize * ((sizeof(int) * elements + (sysPageSize - 1)) / sysPageSize);
// allocate page-aligned memory
int memErr;
if ((memErr = posix_memalign((void **)(&data), sysPageSize, vectorSizeInBytes))) {
if(memErr == EINVAL)
std::cout << "einval" << std::endl;
else
std::cout << "enomem" << std::endl;
throw "could not allocate memory";
}
/* ... initialize data ... */
/* call the function passing the page-aligned memory pointer */
int * input_d_raw;
size_t numbytes = ((int)((elements * sizeof(int) + 4095)/4096)) * 4096;
cudaError_t err = cudaHostRegister((void *) data, numbytes, 0);
if(err != cudaSuccess)
std::cout << "Unable to use pinned memory." << std::endl;
cudaMalloc((void **) &input_d_raw, sizeof(int) * elements);
cudaMemcpyAsync((void *) input_d_raw, (void*) data, elements * sizeof(int), cudaMemcpyHostToDevice, 0);
cudaDeviceSynchronize();
cudaHostUnregister((void *) data);
Thanks in advance for any hints that may help to solve my problem.