CUDA 4.0 manual says that the pointer and the size passed to this API needs to be aligned with the page size(4KB).
But usual malloc() does not allocate a chunk of memory that satisfies this requirement.
Even if the requested size to malloc() is multiples of page size, the returned address is not aligned to the page size.
If I want to use this API, then do I have to allocate more memory than needed and manually align it to the page size?
Is there an easy way of doing it to make cudaHostRegister() API more useful?
Or is there a special malloc() that allocates aligned memory?