I use nvjpeg decode jpeg success and now rgba buffer in gpu memory,then I want to pull this data out to main memory ,so I use malloc+cudaHostRegister and cudaHostAlloc to registe pin-memory then cudaMemcpy from device to host ,I run this code at 3 platforms,and at 2 platforms malloc+cudaHostRegister has better performance than cudaHostAlloc,but rest 1 platform cudaHostRegister block very long time,I print strace like this:
09:18:54.603254 open("/proc/driver/nvidia/params", O_RDONLY) = 36
09:18:54.603337 fstat(36, {st_mode=S_IFREG|0444, st_size=0, …}) = 0
09:18:54.603382 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa485d0f000
09:18:54.603424 read(36, “Mobile: 4294967295\nResmanDebugLe”…, 1024) = 641
09:18:54.603491 close(36) = 0
09:18:54.603533 munmap(0x7fa485d0f000, 4096) = 0
09:18:54.603575 stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), …}) = 0
09:18:54.603628 open("/dev/nvidiactl", O_RDWR) = 36
09:18:54.603677 fcntl(36, F_SETFD, FD_CLOEXEC) = 0
09:18:54.603714 ioctl(6, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x27, 0x38)
, 0x7fff772ca790) = 0
09:19:22.366438 close(36) = 0
09:19:22.366567 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7fff772ca300) = 0
09:19:22.368054 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7fff772ca300) = 0
09:19:22.372559 ioctl(4, _IOC(0, 0, 0x21, 0), 0x7fff772ca300) = 0
09:19:22.373936 getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=368000}, ru_stime={tv_sec=28, tv_usec=560000}, …}) = 0
09:19:22.374005 times({tms_utime=36, tms_stime=2856, tms_cutime=0, tms_cstime=0}) = 3332784180
09:19:22.374054 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=28, tv_nsec=918960203}) = 0
09:19:22.374105 write(1, “allocMemory cost 27771 ms\n”, 27allocMemory cost 27771 ms
) = 27
09:19:22.374195 getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=368000}, ru_stime={tv_sec=28, tv_usec=560000}, …}) = 0
09:19:22.374245 clock_gettime(CLOCK_THREAD_CPUTIME_ID, {tv_sec=28, tv_nsec=919010234}) = 0
09:19:22.374285 times({tms_utime=36, tms_stime=2856, tms_cutime=0, tms_cstime=0}) = 3332784180
I don’t know why cudaRegister has better performance than cudaHostAlloc,and why on last platform cudaRegister block so long time,I have test on cuda 10.1/10.2 but all has this similar problem