Very long latency (2.6s+2.6s) when running CUDA/OpenCL code on Threadripper Linux server

,

I recently purchased a computing cluster with 10 nodes (Theadripper 3990X/ 4x GTX 2080Super/Ubuntu 20.04/256GB/TRX40 Creator/CUDA 10.1/440.100 driver/5.4 kernel). Everything seem to work fine, except that running my CUDA program has a rather long delay when initializing the GPU as well as freeing the device.

I saw a similar report when using NVIDIA GPU with Threadripper Linux server

Using strace -tt -T -r, I found out the below two ioctl(3, _IOC(_IOC_READ|_IOC_WRITE calls to access /dev/nvidia0 added about 2.65s each, while my actual job only took about 100 ms, so over 90% of the running time were wasted by these two ioctl calls.

Same test took only 0.2 s to finish on another machine with intel i7-7700k/TitanV/Ubuntu 16.04.

Same latency happened for OpenCL, but no latency was seen on the CPU jobs (SSE4), so I am pretty sure it was caused by accessing /dev/nvidia*.

can someone suggest how to debug this? my system was installed freshly recently by the vendor, all updates were applied.

the below log shows a 2.6s delay at the beginning of the code, and the bottom block shows the 2nd 2.6s delay when releasing the device.

01:15:23.065251 (+     0.000034) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffe84042c10) = 0 <0.000049>
01:15:23.065312 (+     0.000060) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe84042a50) = 0 <0.000028>
01:15:23.065353 (+     0.000040) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 15 <0.000012>
01:15:23.065378 (+     0.000025) fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
01:15:23.065393 (+     0.000014) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffe84042ba0) = 0 <0.000291>
01:15:23.065698 (+     0.000306) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 15, 0) = 0x200400000 <0.000011>
01:15:23.065722 (+     0.000023) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffe84042b70) = 0 <0.000005>
01:15:23.065739 (+     0.000016) close(15) = 0 <0.000004>
01:15:23.065757 (+     0.000017) ioctl(4, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffe84042a60) = 0 <0.000004>
01:15:23.065772 (+     0.000015) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffe84042210) = 0 <0.000295>
>here> 01:15:23.066090 (+     0.000317) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffe84042ab0) = 0 <2.650923>
01:15:25.717059 (+     2.650973) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe840428f0) = 0 <0.000186>
01:15:25.717263 (+     0.000200) openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 15 <0.000024>
01:15:25.717307 (+     0.000043) fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
01:15:25.717323 (+     0.000016) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffe84042a40) = 0 <0.000056>
01:15:25.717394 (+     0.000070) mmap(0x200600000, 58720256, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 15, 0) = 0x200600000 <0.000408>
01:15:25.717817 (+     0.000422) close(15) = 0 <0.000005>
...
01:15:26.269608 (+     0.000076) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffe84043b20) = 0 <0.000019>
01:15:26.269640 (+     0.000032) munmap(0x7f6ab055d000, 4096) = 0 <0.000009>
01:15:26.269662 (+     0.000021) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043b20) = 0 <0.000035>
01:15:26.269718 (+     0.000055) ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffe84043a00) = 0 <0.000110>
01:15:26.269843 (+     0.000125) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffe84043b10) = 0 <0.000019>
01:15:26.269876 (+     0.000032) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000860>
>here> 01:15:26.270765 (+     0.000889) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043b10) = 0 <2.643325>
01:15:28.914144 (+     2.643380) ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffe84043a10) = 0 <0.000287>
01:15:28.914450 (+     0.000303) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffe84043b20) = 0 <0.000053>
01:15:28.914515 (+     0.000065) mmap(0x200400000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200400000 <0.000024>
01:15:28.914555 (+     0.000039) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043b20) = 0 <0.000037>
01:15:28.914607 (+     0.000052) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043d10) = 0 <0.000045>
01:15:28.914665 (+     0.000058) write(14, "\253", 1) = 1 <0.000006>

here is the full strace output dump

mcx_debug_log.txt (379.1 KB)

Make sure you have the persistence daemon turned on: https://docs.nvidia.com/deploy/driver-persistence/index.html

Beyond that, CUDA startup includes building a unified virtual address map that incorporates all GPU memory and all system memory. The larger the system memory and the more GPUs in the system, the longer this takes, with almost all of that time spent in single-threaded operating system calls. A high-frequency CPU will help (I recommend > 3.5 GHz base frequency), and high-throughput system memory may help.

for testing purposes, I currently only have 2x 2080S in the box,
I confirm that nvidia-persistenced is running, and I also enabled persistent mode, see below nvidia-smi output.

However, these still result in the same latency and runtime.

fangq@moli:~/space/git/Project/github/mcx/test$ nvidia-smi
Sun Aug 16 22:51:33 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8     5W / 250W |     16MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
|  0%   29C    P8    19W / 250W |      1MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2627      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+
fangq@moli:~/space/git/Project/github/mcx/test$ ps aux | grep persis
nvidia-+    2034  0.0  0.0   5108  1624 ?        Ss   Aug12   0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

I don’t have any additional insights into this. My system only has 32 GB of system memory and one 8 GB GPU, so I do not know what overhead to expect for building the virtual memory map on a system with 256 GB of system memory. I agree that 2.6 seconds startup overhead seems excessively slow.

I notice in the nvidia-persistenced command, there is a --no-persistence-mode flag, does that mean the persistence mode is not enabled?

fangq@moli:~/space/git/Project/github/mcx/test$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
     Active: active (running) since Wed 2020-08-12 12:19:58 PDT; 4 days ago
   Main PID: 2034 (nvidia-persiste)
      Tasks: 1 (limit: 309057)
     Memory: 1.4M
     CGroup: /system.slice/nvidia-persistenced.service
             └─2034 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

Aug 12 12:19:58 moli nvidia-persistenced[2034]: Now running with user ID 123 and group ID 133
Aug 12 12:19:58 moli nvidia-persistenced[2034]: Started (2034)
Aug 12 12:19:58 moli nvidia-persistenced[2034]: device 0000:01:00.0 - registered
Aug 12 12:19:58 moli nvidia-persistenced[2034]: device 0000:21:00.0 - registered
Aug 12 12:19:58 moli nvidia-persistenced[2034]: Local RPC services initialized
Aug 12 12:19:58 moli systemd[1]: Started NVIDIA Persistence Daemon.
Aug 16 22:50:23 moli nvidia-persistenced[2034]: device 0000:01:00.0 - persistence mode enabled.
Aug 16 22:50:23 moli nvidia-persistenced[2034]: device 0000:01:00.0 - NUMA memory onlined.
Aug 16 22:50:26 moli nvidia-persistenced[2034]: device 0000:21:00.0 - persistence mode enabled.
Aug 16 22:50:26 moli nvidia-persistenced[2034]: device 0000:21:00.0 - NUMA memory onlined.

From the documentation:

Use --persistence-mode to force persistence mode on for all devices on startup.

thanks, I manually set the persistence mode to on for both GPUs, unfortunately it did not make a difference.

about 3 years ago, we built a GPU node with 12x 1080Ti, initially, I also had a similar latency issue, but running nvidia-persistenced reduced that latency to almost nothing - strange that this time it does not seem to make a difference.

is it possible to disable this virtual memory creation? or even skipping the host memory part?

There should definitely be a difference between running with --no-persistence-mode versus running with --persistence-mode. It seems to me you would want to double check on the persistence daemon status. Maybe monitor with watch -n 2 systemctl status nvidia-persistence?

The unified virtual memory space has been an essential feature of CUDA for many years, so creating the required memory map is mandatory. It shouldn’t take 2+ seconds, though.

I added --persistence-mode and [Install] section according to this post

after a reboot, persistence mode appears to be on for both GPUs, but when running either a cuda or the opencl code, the behavior is almost exactly the same - a 2.6s delay at the beginning and another 2.6s at the end.

$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2020-08-17 07:36:56 PDT; 14min ago
    Process: 2018 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=0/SUCCESS)
   Main PID: 2021 (nvidia-persiste)
      Tasks: 1 (limit: 309057)
     Memory: 988.0K
     CGroup: /system.slice/nvidia-persistenced.service
             └─2021 /usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose

Aug 17 07:36:54 moli nvidia-persistenced[2021]: Now running with user ID 123 and group ID 133
Aug 17 07:36:54 moli nvidia-persistenced[2021]: Started (2021)
Aug 17 07:36:54 moli nvidia-persistenced[2021]: device 0000:01:00.0 - registered
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - persistence mode enabled.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - NUMA memory onlined.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:21:00.0 - registered
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - persistence mode enabled.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - NUMA memory onlined.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: Local RPC services initialized
Aug 17 07:36:56 moli systemd[1]: Started NVIDIA Persistence Daemon.
$ nvidia-smi
Mon Aug 17 07:52:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8     5W / 250W |     16MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
|  0%   29C    P8    19W / 250W |      1MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2650      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

strangely, listing devices via nvidia-smi or my program mcx -L printed the GPU info almost instantly, but clinfo was hit by this 2.6s delay 6 times:

$ grep -B9 '+\s*2\.6' log_clinfo.txt 
07:45:49.590064 (+     0.000050) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 13 <0.000018>
07:45:49.590096 (+     0.000032) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:45:49.590113 (+     0.000016) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230a9f0) = 0 <0.000292>
07:45:49.590418 (+     0.000305) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000022>
07:45:49.590454 (+     0.000035) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230a9c0) = 0 <0.000007>
07:45:49.590473 (+     0.000019) close(13) = 0 <0.000004>
07:45:49.590494 (+     0.000021) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230a8b0) = 0 <0.000006>
07:45:49.590513 (+     0.000018) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a060) = 0 <0.000261>
07:45:49.590799 (+     0.000285) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230a900) = 0 <2.649303>
07:45:52.240130 (+     2.649337) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230a740) = 0 <0.000191>
--
07:45:52.733005 (+     0.000033) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9f0) = 0 <0.000033>
07:45:52.733052 (+     0.000047) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230bbb0) = 0 <0.000045>
07:45:52.733110 (+     0.000058) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9d0) = 0 <0.000014>
07:45:52.733137 (+     0.000026) munmap(0x7f868a175000, 4096) = 0 <0.000006>
07:45:52.733154 (+     0.000017) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9d0) = 0 <0.000028>
07:45:52.733198 (+     0.000044) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8b0) = 0 <0.000094>
07:45:52.733304 (+     0.000106) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9c0) = 0 <0.000014>
07:45:52.733331 (+     0.000026) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000461>
07:45:52.733804 (+     0.000473) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9c0) = 0 <2.635809>
07:45:55.369639 (+     2.635838) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8c0) = 0 <0.000267>
--
07:45:55.378751 (+     0.000049) openat(AT_FDCWD, "/dev/nvidia1", O_RDWR) = 13 <0.000011>
07:45:55.378776 (+     0.000024) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:45:55.378792 (+     0.000015) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230a9f0) = 0 <0.000238>
07:45:55.379042 (+     0.000250) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000011>
07:45:55.379069 (+     0.000027) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230a9c0) = 0 <0.000005>
07:45:55.379086 (+     0.000017) close(13) = 0 <0.000004>
07:45:55.379104 (+     0.000017) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230a8b0) = 0 <0.000004>
07:45:55.379120 (+     0.000016) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a060) = 0 <0.000293>
07:45:55.379435 (+     0.000315) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230a900) = 0 <2.644974>
07:45:58.024453 (+     2.645021) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230a740) = 0 <0.000126>
--
07:45:58.385133 (+     0.000044) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9f0) = 0 <0.000039>
07:45:58.385189 (+     0.000056) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230bbb0) = 0 <0.000056>
07:45:58.385262 (+     0.000072) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9d0) = 0 <0.000018>
07:45:58.385295 (+     0.000033) munmap(0x7f868a175000, 4096) = 0 <0.000009>
07:45:58.385318 (+     0.000022) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9d0) = 0 <0.000033>
07:45:58.385372 (+     0.000053) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8b0) = 0 <0.000102>
07:45:58.385491 (+     0.000118) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9c0) = 0 <0.000019>
07:45:58.385525 (+     0.000034) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000695>
07:45:58.386236 (+     0.000711) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9c0) = 0 <2.644142>
07:46:01.030418 (+     2.644183) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8c0) = 0 <0.000258>
--
07:46:01.101095 (+     0.000046) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 13 <0.000015>
07:46:01.101126 (+     0.000031) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:46:01.101144 (+     0.000017) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230ae70) = 0 <0.000119>
07:46:01.101278 (+     0.000133) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000018>
07:46:01.101311 (+     0.000033) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230ae40) = 0 <0.000007>
07:46:01.101332 (+     0.000020) close(13) = 0 <0.000004>
07:46:01.101354 (+     0.000021) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230ad30) = 0 <0.000005>
07:46:01.101374 (+     0.000019) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a4e0) = 0 <0.000068>
07:46:01.101466 (+     0.000091) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230ad80) = 0 <2.647105>
07:46:03.748613 (+     2.647153) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230abc0) = 0 <0.000104>
--
07:46:04.119888 (+     0.000055) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be70) = 0 <0.000049>
07:46:04.119954 (+     0.000066) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230c030) = 0 <0.000073>
07:46:04.120046 (+     0.000091) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230be50) = 0 <0.000022>
07:46:04.120082 (+     0.000036) munmap(0x7f868a175000, 4096) = 0 <0.000013>
07:46:04.120110 (+     0.000027) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be50) = 0 <0.000042>
07:46:04.120177 (+     0.000066) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230bd30) = 0 <0.000119>
07:46:04.120312 (+     0.000135) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230be40) = 0 <0.000021>
07:46:04.120348 (+     0.000036) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.001037>
07:46:04.121403 (+     0.001054) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be40) = 0 <2.639314>
07:46:06.760744 (+     2.639344) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230bd40) = 0 <0.000265>

found another thread reporting a similar issue, albeit different type of delays, wondering what is the status of internal bug #2010268?

Bugs at NVIDIA are always confidential, because many bug reports include proprietary information of one kind or other. Consequently only the filer and relevant NVIDIA personnel gets to look at bug reports.

Your issue may be identical to an existing bug, or it may be an as-of-yet unreported issue. The only sure way to find out is to file a bug report yourself. Once NVIDIA engineering have a look at it, they will determine whether it is a duplicate or a new and separate issue.

here is the output of nvidia-bug-report

nvidia-bug-report.log.gz (464.2 KB)

here is the strace output of my CUDA program mcx (GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator), there are two 2.6s delays in the execution (run grep '+\s*2\.6' mcx_debug_log.txt to see the delays), the actual workload is only 100ms
mcx_debug_log.txt.gz (24.7 KB)

same latency also happens on OpenCL based programs, here is the output of strace when running clinfo, there are total of six 2.6s delays in the entire call (run grep '+\s*2\.6' log_clinfo.txt to see the delays)

log_clinfo.txt.gz (131.6 KB)

The only sure way to find out is to file a bug report yourself. Once NVIDIA engineering have a look at it, they will determine whether it is a duplicate or a new and separate issue.

done, submitted (Bug# 3099084)

Just want to follow up on this issue - I heard back from NVIDIA’s engineer that this issue is now fixed in a new driver release 450.66 (currently in focal-propose), I am very glad to confirm that the latency issue is gone.

Both cuda and opencl codes can start/end without that latency. If you ran into the same issue (Threadripper/Ryzen CPU+NVIDIA GPU), wait for this driver!