I have several host memory blocks to be pinned. This was taking a substantial amount of time (~ 12 milliseconds).
Pre-allocating and pinning a single large buffer did not noticeably change the pin time. Neither did using cudaMallocHost instead of malloc/cudaHostRegister.
It looks like cudaHostRegister time is approximately linear in the buffer size. Is this correct? Or am I missing something?