Consequences of not page-aligning buffers for cudaHostRegister()?

I’m new to CUDA and am currently experimenting with streams, which requires cudaMemcpyAsync(), which in turn requires the host pointers to be pinned. The majority of these are vector<> types though, rather than regular buffers created using malloc(), so I can see it becoming a bit messy to resize them based on the page-aligned size. Some are vector but others use more complex structs.

What is likely to happen if we don’t resize them to the page size, and pass their actual size to cudaHostRegister(). Will we see crashes or is it more likely to “just” affect performance?

The buffers in question are very small (a few tens of Kb at most), and are typically used to copy “lookup tables” to the GPU, and to copy the processed/reduced results back to the host. The much larger “source” buffers being sent to the GPU for processing are page-aligned.

It’s running on Windows by the way, Quadro RTX 4000, if that makes a difference.

I’ve never heard of cudaHostRegister() requiring specific allocation sizes to work correctly. Nor does it require (that I know of) a specific alignment to work correctly.

Thanks. So what are the advantages of page aligning - does it improve throughput when copying? Are there any risks in not doing this (i.e. crashes)?

I’ve not run into any discussions or documentation that says anything about page aligning. Have you? How do you page-align an allocation? I guess you are referring to something like posix_memalign()? I’ve never run into any recommendations about using that as a precursor to cudaHostRegister()

Earl(ier) CUDA versions had such a requirement documented:

→ The pointer ptr and size size must be aligned to the host page size (4 KB).

See also: Using cudaHostRegister() in CUDA 4.0 CUDA 4.0 - #16 by mfatica

→ FYI in CUDA 4.1 the restrictions on alignment and size are gone. You can host register a generic pointer.

See also cuda-samples/Samples/0_Introduction/simpleZeroCopy/ at master · NVIDIA/cuda-samples · GitHub line 165:

// We need to ensure memory is aligned to 4K (so we will need to padd memory
// accordingly)

I think the worst, what will still happen, is that without aligning on a page-boundary a larger amount of memory than necessary is page-locked. (Up to one page more per call). So it is no problem, if you do not have too many such memory areas.

More importantly, when using vector<> and other dynamic data structures is that you do not have to register memory too often, as that would take more time than not using regular cudaMemcpy instead of cudaMemcpyAsync(). So reserve a size large enough, if you plan to change their contents.

1 Like

I suspect that’s what it is - I’m primarily using examples from a book called Cuda By Example, but I’ve just noticed it was published in 2011.