cudaHostRegister time linear in buffer size?

dlevi · February 6, 2018, 3:09pm

I have several host memory blocks to be pinned. This was taking a substantial amount of time (~ 12 milliseconds).

Pre-allocating and pinning a single large buffer did not noticeably change the pin time. Neither did using cudaMallocHost instead of malloc/cudaHostRegister.

It looks like cudaHostRegister time is approximately linear in the buffer size. Is this correct? Or am I missing something?

Robert_Crovella · February 6, 2018, 3:25pm

I believe that is correct.

njuffa · February 6, 2018, 3:59pm

Some third-party data I have seen suggests that the time complexity could be worse than linear, at least for some (large) allocation sizes on some operating systems. These CUDA API calls are thin wrappers that map pretty much directly to operating system API calls, so CUDA is at the mercy of the OS as far as performance goes. If you use an open-source OS, you should be able to track down the details of the underlying OS mechanics.

dlevi · February 6, 2018, 4:23pm

Thanks.

This actually makes sense now that I think about it. The O/S may want to rearrange pages to accommodate a pin. The bigger the pin, the more pages to rearrange, and the more likely those pages will be scattered.

njuffa · February 6, 2018, 6:01pm

My conclusion was pretty much the same. I don’t know whether it’s true, though. At least it makes for a plausible working hypothesis.