cudaHostRegister block very long time

I use nvjpeg decode jpeg success and now rgba buffer in gpu memory,then I want to pull this data out to main memory ,so I use malloc+cudaHostRegister and cudaHostAlloc to registe pin-memory then cudaMemcpy from device to host ,I run this code at 3 platforms,and at 2 platforms malloc+cudaHostRegister has better performance than cudaHostAlloc,but rest 1 platform cudaHostRegister block very long time,I print strace like this:
09:18:54.603254 open("/proc/driver/nvidia/params", O_RDONLY) = 36
09:18:54.603337 fstat(36, {st_mode=S_IFREG|0444, st_size=0, …}) = 0
09:18:54.603382 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa485d0f000
09:18:54.603424 read(36, “Mobile: 4294967295\nResmanDebugLe”…, 1024) = 641
09:18:54.603491 close(36) = 0
09:18:54.603533 munmap(0x7fa485d0f000, 4096) = 0
09:18:54.603575 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), …}) = 0
09:18:54.603628 open("/dev/nvidiactl", O_RDWR) = 36
09:18:54.603677 fcntl(36, F_SETFD, FD_CLOEXEC) = 0
09:18:54.603714 ioctl(6, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x27, 0x38)

, 0x7fff772ca790) = 0
09:19:22.366438 close(36) = 0
09:19:22.366567 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7fff772ca300) = 0
09:19:22.368054 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7fff772ca300) = 0
09:19:22.372559 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7fff772ca300) = 0
09:19:22.373936 getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=368000}, ru_stime={tv_sec=28, tv_usec=560000}, …}) = 0
09:19:22.374005 times({tms_utime=36, tms_stime=2856, tms_cutime=0, tms_cstime=0}) = 3332784180
09:19:22.374054 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=28, tv_nsec=918960203}) = 0
09:19:22.374105 write(1, “allocMemory cost 27771 ms\n”, 27allocMemory cost 27771 ms
) = 27
09:19:22.374195 getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=368000}, ru_stime={tv_sec=28, tv_usec=560000}, …}) = 0
09:19:22.374245 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=28, tv_nsec=919010234}) = 0
09:19:22.374285 times({tms_utime=36, tms_stime=2856, tms_cutime=0, tms_cstime=0}) = 3332784180

I don’t know why cudaRegister has better performance than cudaHostAlloc,and why on last platform cudaRegister block so long time,I have test on cuda 10.1/10.2 but all has this similar problem

These CUDA API calls spend pretty much their entire time in OS API calls. If you see performance differences, they would likely be down to (1) different OS (2) different OS version (3) different OS configuration (4) different hardware platform (5) different HW platform configuration. A standard debugging approach is to minimize differences between platforms resulting in (almost) identical performance, then perform controlled experiments in which only one factor changes in any given step. That should reveal the root cause.

No information was provided regarding any of the items (1) through (5). In terms of (4), the performance of the OS API calls involved is typically a function of CPU single-thread performance and system memory performance. A reasonable hypothesis is that the time for OS API calls for memory management correlates with the total amount of system memory, the total number of allocations, and the utilization rate of memory.

Conceptually, cudaHostAlloc requires memory allocation plus registration, which suggests that registration alone would be cheaper (the allocation and its associated cost already happened at some other time). But I have not looked into details of the underlying OS API calls in a decade.