Why cudaHostRegister can have a huge difference in performance?

I’m using cudaHostRegister for faster cpu-gpu transport, however, I met a huge difference in performance:
The first block of memory is from a rust vector (which has been initialized) and its size is 8G and it took about 259ms to pin it.
However, the second block of memory is from a block of memory allocated by the rust allocator(which is the same used by the rust vector), uninitialized, and its size is also 8G, it took 1678ms to pin it.
So I’m really curious what can cause this huge gap, and how should I narrow the gap.

Perhaps the second was not on page boundaries? Or the initialization of the first was just with zeros.
I guess, you tried to switch the order? What is the overall memory of your system?

What will happen if it’s not on boundary? By the way, I believe it is probably not due to the problem of the boundary as either approach cannot ensure the memory address to be on boundary.
And the initialization of the first are random numbers. Switch the order has no effect. though I’m not sure how large is the memory, I am sure that it is larger than 256g.
My mentor suggests that it may due to the os didn’t assign page for the uninitiated memory. I will test that idea later.

Try initializing the second block and check how that changes the time required for cudaHostRegister. If that does not help, you would want to seek the assistance of a specialist for the operating system you are using, as cudaHostRegister is likely just a thin wrapper around OS API calls. A system trace utility can tell you which ones.

Initialization seems to be a factor:

# cat t291.cu
#include <iostream>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
const size_t GB8 = 1024ULL*1024*1024*8;

int main(){

  int *d = (int *)malloc(GB8);
#ifdef USE_INIT
  memset(d, 0, GB8);
#endif
  unsigned long long dt = dtime_usec(0);
  cudaHostRegister(d, GB8, cudaHostRegisterDefault);
  dt = dtime_usec(dt);
  std::cout << "duration: " << dt/(float)USECPSEC << "s" << std::endl;
}
# nvcc -o t291 t291.cu
# ./t291
duration: 5.18446s
# nvcc -o t291 t291.cu -DUSE_INIT
# ./t291
duration: 2.36863s
#

A typical allocator like malloc may allocate on first touch. That is the call itself creates a spot in the VM map for the allocation to reside, but physical page support is not provided. Physical page support would then be provided by the host OS page-fault mechanism.

However a pinned allocation is not allowed to be “un-paged”. Therefore registering an allocation that has unassigned pages probably requires additional steps.