Very long latency (2.6s+2.6s) when running CUDA/OpenCL code on Threadripper Linux server

FangQ · August 17, 2020, 5:38am

I recently purchased a computing cluster with 10 nodes (Theadripper 3990X/ 4x GTX 2080Super/Ubuntu 20.04/256GB/TRX40 Creator/CUDA 10.1/440.100 driver/5.4 kernel). Everything seem to work fine, except that running my CUDA program has a rather long delay when initializing the GPU as well as freeing the device.

I saw a similar report when using NVIDIA GPU with Threadripper Linux server

Using strace -tt -T -r, I found out the below two ioctl(3, _IOC(_IOC_READ|_IOC_WRITE calls to access /dev/nvidia0 added about 2.65s each, while my actual job only took about 100 ms, so over 90% of the running time were wasted by these two ioctl calls.

Same test took only 0.2 s to finish on another machine with intel i7-7700k/TitanV/Ubuntu 16.04.

Same latency happened for OpenCL, but no latency was seen on the CPU jobs (SSE4), so I am pretty sure it was caused by accessing /dev/nvidia*.

can someone suggest how to debug this? my system was installed freshly recently by the vendor, all updates were applied.

the below log shows a 2.6s delay at the beginning of the code, and the bottom block shows the 2nd 2.6s delay when releasing the device.

01:15:23.065251 (+     0.000034) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffe84042c10) = 0 <0.000049>
01:15:23.065312 (+     0.000060) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe84042a50) = 0 <0.000028>
01:15:23.065353 (+     0.000040) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 15 <0.000012>
01:15:23.065378 (+     0.000025) fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
01:15:23.065393 (+     0.000014) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffe84042ba0) = 0 <0.000291>
01:15:23.065698 (+     0.000306) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 15, 0) = 0x200400000 <0.000011>
01:15:23.065722 (+     0.000023) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffe84042b70) = 0 <0.000005>
01:15:23.065739 (+     0.000016) close(15) = 0 <0.000004>
01:15:23.065757 (+     0.000017) ioctl(4, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffe84042a60) = 0 <0.000004>
01:15:23.065772 (+     0.000015) ioctl(4, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffe84042210) = 0 <0.000295>
>here> 01:15:23.066090 (+     0.000317) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffe84042ab0) = 0 <2.650923>
01:15:25.717059 (+     2.650973) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe840428f0) = 0 <0.000186>
01:15:25.717263 (+     0.000200) openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 15 <0.000024>
01:15:25.717307 (+     0.000043) fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
01:15:25.717323 (+     0.000016) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffe84042a40) = 0 <0.000056>
01:15:25.717394 (+     0.000070) mmap(0x200600000, 58720256, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 15, 0) = 0x200600000 <0.000408>
01:15:25.717817 (+     0.000422) close(15) = 0 <0.000005>
...

01:15:26.269608 (+     0.000076) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffe84043b20) = 0 <0.000019>
01:15:26.269640 (+     0.000032) munmap(0x7f6ab055d000, 4096) = 0 <0.000009>
01:15:26.269662 (+     0.000021) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043b20) = 0 <0.000035>
01:15:26.269718 (+     0.000055) ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffe84043a00) = 0 <0.000110>
01:15:26.269843 (+     0.000125) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffe84043b10) = 0 <0.000019>
01:15:26.269876 (+     0.000032) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000860>
>here> 01:15:26.270765 (+     0.000889) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043b10) = 0 <2.643325>
01:15:28.914144 (+     2.643380) ioctl(4, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffe84043a10) = 0 <0.000287>
01:15:28.914450 (+     0.000303) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffe84043b20) = 0 <0.000053>
01:15:28.914515 (+     0.000065) mmap(0x200400000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200400000 <0.000024>
01:15:28.914555 (+     0.000039) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043b20) = 0 <0.000037>
01:15:28.914607 (+     0.000052) ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffe84043d10) = 0 <0.000045>
01:15:28.914665 (+     0.000058) write(14, "\253", 1) = 1 <0.000006>

FangQ · August 17, 2020, 5:48am

here is the full strace output dump

mcx_debug_log.txt (379.1 KB)

njuffa · August 17, 2020, 5:48am

Make sure you have the persistence daemon turned on: https://docs.nvidia.com/deploy/driver-persistence/index.html

Beyond that, CUDA startup includes building a unified virtual address map that incorporates all GPU memory and all system memory. The larger the system memory and the more GPUs in the system, the longer this takes, with almost all of that time spent in single-threaded operating system calls. A high-frequency CPU will help (I recommend > 3.5 GHz base frequency), and high-throughput system memory may help.

FangQ · August 17, 2020, 5:54am

for testing purposes, I currently only have 2x 2080S in the box,
I confirm that nvidia-persistenced is running, and I also enabled persistent mode, see below nvidia-smi output.

However, these still result in the same latency and runtime.

fangq@moli:~/space/git/Project/github/mcx/test$ nvidia-smi
Sun Aug 16 22:51:33 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8     5W / 250W |     16MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
|  0%   29C    P8    19W / 250W |      1MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2627      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+
fangq@moli:~/space/git/Project/github/mcx/test$ ps aux | grep persis
nvidia-+    2034  0.0  0.0   5108  1624 ?        Ss   Aug12   0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

njuffa · August 17, 2020, 5:59am

I don’t have any additional insights into this. My system only has 32 GB of system memory and one 8 GB GPU, so I do not know what overhead to expect for building the virtual memory map on a system with 256 GB of system memory. I agree that 2.6 seconds startup overhead seems excessively slow.

FangQ · August 17, 2020, 6:05am

I notice in the nvidia-persistenced command, there is a --no-persistence-mode flag, does that mean the persistence mode is not enabled?

fangq@moli:~/space/git/Project/github/mcx/test$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
     Active: active (running) since Wed 2020-08-12 12:19:58 PDT; 4 days ago
   Main PID: 2034 (nvidia-persiste)
      Tasks: 1 (limit: 309057)
     Memory: 1.4M
     CGroup: /system.slice/nvidia-persistenced.service
             └─2034 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose

Aug 12 12:19:58 moli nvidia-persistenced[2034]: Now running with user ID 123 and group ID 133
Aug 12 12:19:58 moli nvidia-persistenced[2034]: Started (2034)
Aug 12 12:19:58 moli nvidia-persistenced[2034]: device 0000:01:00.0 - registered
Aug 12 12:19:58 moli nvidia-persistenced[2034]: device 0000:21:00.0 - registered
Aug 12 12:19:58 moli nvidia-persistenced[2034]: Local RPC services initialized
Aug 12 12:19:58 moli systemd[1]: Started NVIDIA Persistence Daemon.
Aug 16 22:50:23 moli nvidia-persistenced[2034]: device 0000:01:00.0 - persistence mode enabled.
Aug 16 22:50:23 moli nvidia-persistenced[2034]: device 0000:01:00.0 - NUMA memory onlined.
Aug 16 22:50:26 moli nvidia-persistenced[2034]: device 0000:21:00.0 - persistence mode enabled.
Aug 16 22:50:26 moli nvidia-persistenced[2034]: device 0000:21:00.0 - NUMA memory onlined.

njuffa · August 17, 2020, 6:09am

From the documentation:

Use --persistence-mode to force persistence mode on for all devices on startup.

FangQ · August 17, 2020, 6:19am

thanks, I manually set the persistence mode to on for both GPUs, unfortunately it did not make a difference.

about 3 years ago, we built a GPU node with 12x 1080Ti, initially, I also had a similar latency issue, but running nvidia-persistenced reduced that latency to almost nothing - strange that this time it does not seem to make a difference.

is it possible to disable this virtual memory creation? or even skipping the host memory part?

njuffa · August 17, 2020, 6:27am

There should definitely be a difference between running with --no-persistence-mode versus running with --persistence-mode. It seems to me you would want to double check on the persistence daemon status. Maybe monitor with watch -n 2 systemctl status nvidia-persistence?

The unified virtual memory space has been an essential feature of CUDA for many years, so creating the required memory map is mandatory. It shouldn’t take 2+ seconds, though.

FangQ · August 17, 2020, 3:03pm

I added --persistence-mode and [Install] section according to this post

after a reboot, persistence mode appears to be on for both GPUs, but when running either a cuda or the opencl code, the behavior is almost exactly the same - a 2.6s delay at the beginning and another 2.6s at the end.

$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2020-08-17 07:36:56 PDT; 14min ago
    Process: 2018 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=0/SUCCESS)
   Main PID: 2021 (nvidia-persiste)
      Tasks: 1 (limit: 309057)
     Memory: 988.0K
     CGroup: /system.slice/nvidia-persistenced.service
             └─2021 /usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose

Aug 17 07:36:54 moli nvidia-persistenced[2021]: Now running with user ID 123 and group ID 133
Aug 17 07:36:54 moli nvidia-persistenced[2021]: Started (2021)
Aug 17 07:36:54 moli nvidia-persistenced[2021]: device 0000:01:00.0 - registered
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - persistence mode enabled.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - NUMA memory onlined.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:21:00.0 - registered
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - persistence mode enabled.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - NUMA memory onlined.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: Local RPC services initialized
Aug 17 07:36:56 moli systemd[1]: Started NVIDIA Persistence Daemon.

$ nvidia-smi
Mon Aug 17 07:52:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8     5W / 250W |     16MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
|  0%   29C    P8    19W / 250W |      1MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2650      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

strangely, listing devices via nvidia-smi or my program mcx -L printed the GPU info almost instantly, but clinfo was hit by this 2.6s delay 6 times:

$ grep -B9 '+\s*2\.6' log_clinfo.txt 
07:45:49.590064 (+     0.000050) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 13 <0.000018>
07:45:49.590096 (+     0.000032) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:45:49.590113 (+     0.000016) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230a9f0) = 0 <0.000292>
07:45:49.590418 (+     0.000305) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000022>
07:45:49.590454 (+     0.000035) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230a9c0) = 0 <0.000007>
07:45:49.590473 (+     0.000019) close(13) = 0 <0.000004>
07:45:49.590494 (+     0.000021) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230a8b0) = 0 <0.000006>
07:45:49.590513 (+     0.000018) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a060) = 0 <0.000261>
07:45:49.590799 (+     0.000285) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230a900) = 0 <2.649303>
07:45:52.240130 (+     2.649337) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230a740) = 0 <0.000191>
--
07:45:52.733005 (+     0.000033) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9f0) = 0 <0.000033>
07:45:52.733052 (+     0.000047) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230bbb0) = 0 <0.000045>
07:45:52.733110 (+     0.000058) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9d0) = 0 <0.000014>
07:45:52.733137 (+     0.000026) munmap(0x7f868a175000, 4096) = 0 <0.000006>
07:45:52.733154 (+     0.000017) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9d0) = 0 <0.000028>
07:45:52.733198 (+     0.000044) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8b0) = 0 <0.000094>
07:45:52.733304 (+     0.000106) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9c0) = 0 <0.000014>
07:45:52.733331 (+     0.000026) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000461>
07:45:52.733804 (+     0.000473) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9c0) = 0 <2.635809>
07:45:55.369639 (+     2.635838) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8c0) = 0 <0.000267>
--
07:45:55.378751 (+     0.000049) openat(AT_FDCWD, "/dev/nvidia1", O_RDWR) = 13 <0.000011>
07:45:55.378776 (+     0.000024) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:45:55.378792 (+     0.000015) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230a9f0) = 0 <0.000238>
07:45:55.379042 (+     0.000250) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000011>
07:45:55.379069 (+     0.000027) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230a9c0) = 0 <0.000005>
07:45:55.379086 (+     0.000017) close(13) = 0 <0.000004>
07:45:55.379104 (+     0.000017) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230a8b0) = 0 <0.000004>
07:45:55.379120 (+     0.000016) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a060) = 0 <0.000293>
07:45:55.379435 (+     0.000315) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230a900) = 0 <2.644974>
07:45:58.024453 (+     2.645021) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230a740) = 0 <0.000126>
--
07:45:58.385133 (+     0.000044) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9f0) = 0 <0.000039>
07:45:58.385189 (+     0.000056) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230bbb0) = 0 <0.000056>
07:45:58.385262 (+     0.000072) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9d0) = 0 <0.000018>
07:45:58.385295 (+     0.000033) munmap(0x7f868a175000, 4096) = 0 <0.000009>
07:45:58.385318 (+     0.000022) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9d0) = 0 <0.000033>
07:45:58.385372 (+     0.000053) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8b0) = 0 <0.000102>
07:45:58.385491 (+     0.000118) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9c0) = 0 <0.000019>
07:45:58.385525 (+     0.000034) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000695>
07:45:58.386236 (+     0.000711) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9c0) = 0 <2.644142>
07:46:01.030418 (+     2.644183) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8c0) = 0 <0.000258>
--
07:46:01.101095 (+     0.000046) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 13 <0.000015>
07:46:01.101126 (+     0.000031) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:46:01.101144 (+     0.000017) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230ae70) = 0 <0.000119>
07:46:01.101278 (+     0.000133) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000018>
07:46:01.101311 (+     0.000033) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230ae40) = 0 <0.000007>
07:46:01.101332 (+     0.000020) close(13) = 0 <0.000004>
07:46:01.101354 (+     0.000021) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230ad30) = 0 <0.000005>
07:46:01.101374 (+     0.000019) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a4e0) = 0 <0.000068>
07:46:01.101466 (+     0.000091) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230ad80) = 0 <2.647105>
07:46:03.748613 (+     2.647153) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230abc0) = 0 <0.000104>
--
07:46:04.119888 (+     0.000055) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be70) = 0 <0.000049>
07:46:04.119954 (+     0.000066) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230c030) = 0 <0.000073>
07:46:04.120046 (+     0.000091) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230be50) = 0 <0.000022>
07:46:04.120082 (+     0.000036) munmap(0x7f868a175000, 4096) = 0 <0.000013>
07:46:04.120110 (+     0.000027) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be50) = 0 <0.000042>
07:46:04.120177 (+     0.000066) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230bd30) = 0 <0.000119>
07:46:04.120312 (+     0.000135) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230be40) = 0 <0.000021>
07:46:04.120348 (+     0.000036) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.001037>
07:46:04.121403 (+     0.001054) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be40) = 0 <2.639314>
07:46:06.760744 (+     2.639344) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230bd40) = 0 <0.000265>

FangQ · August 17, 2020, 4:03pm

found another thread reporting a similar issue, albeit different type of delays, wondering what is the status of internal bug #2010268?

njuffa · August 18, 2020, 4:18am

Bugs at NVIDIA are always confidential, because many bug reports include proprietary information of one kind or other. Consequently only the filer and relevant NVIDIA personnel gets to look at bug reports.

Your issue may be identical to an existing bug, or it may be an as-of-yet unreported issue. The only sure way to find out is to file a bug report yourself. Once NVIDIA engineering have a look at it, they will determine whether it is a duplicate or a new and separate issue.

FangQ · August 18, 2020, 8:53pm

here is the output of nvidia-bug-report

nvidia-bug-report.log.gz (464.2 KB)

here is the strace output of my CUDA program mcx (GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator), there are two 2.6s delays in the execution (run grep '+\s*2\.6' mcx_debug_log.txt to see the delays), the actual workload is only 100ms
mcx_debug_log.txt.gz (24.7 KB)

same latency also happens on OpenCL based programs, here is the output of strace when running clinfo, there are total of six 2.6s delays in the entire call (run grep '+\s*2\.6' log_clinfo.txt to see the delays)

log_clinfo.txt.gz (131.6 KB)

FangQ · August 18, 2020, 8:55pm

The only sure way to find out is to file a bug report yourself. Once NVIDIA engineering have a look at it, they will determine whether it is a duplicate or a new and separate issue.

done, submitted (Bug# 3099084)

FangQ · September 1, 2020, 10:10pm

Just want to follow up on this issue - I heard back from NVIDIA’s engineer that this issue is now fixed in a new driver release 450.66 (currently in focal-propose), I am very glad to confirm that the latency issue is gone.

Both cuda and opencl codes can start/end without that latency. If you ran into the same issue (Threadripper/Ryzen CPU+NVIDIA GPU), wait for this driver!

Topic		Replies	Views
Power9 - nvidia-smi shows "unknown error" in memory column Linux	35	10216	October 14, 2021
cudaFreeHost very slow for buffers > 2GB [newbee here!] CUDA Programming and Performance	30	2495	October 3, 2019
Severe CUDA performances regression on Kepler hardware (K20, K40, K80) using latest drivers (410.xx) Linux	18	1542	November 26, 2018
340.106 nvidia-uvm.ko fails to build under kernel 4.14.y Linux	16	7305	October 14, 2021
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64108	April 20, 2011
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	2860	August 27, 2023
Install Problem CUDA Programming and Performance	32	12687	December 17, 2009
After installing CUDA 9.0 in POWER9(RHEL7), nvidia-smi shows Unknown Error in Memory_Usage column. CUDA Setup and Installation	18	3126	June 8, 2018
Failed: CUDA driver version is insufficient for CUDA runtime version Parabricks cuda , containers , ai , driver	8	2005	November 21, 2023
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1123	April 17, 2023

Very long latency (2.6s+2.6s) when running CUDA/OpenCL code on Threadripper Linux server

Related topics