Very long latency (2.6s+2.6s) when running CUDA/OpenCL code on Threadripper Linux server

,

I added --persistence-mode and [Install] section according to this post

after a reboot, persistence mode appears to be on for both GPUs, but when running either a cuda or the opencl code, the behavior is almost exactly the same - a 2.6s delay at the beginning and another 2.6s at the end.

$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled)
     Active: active (running) since Mon 2020-08-17 07:36:56 PDT; 14min ago
    Process: 2018 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=0/SUCCESS)
   Main PID: 2021 (nvidia-persiste)
      Tasks: 1 (limit: 309057)
     Memory: 988.0K
     CGroup: /system.slice/nvidia-persistenced.service
             └─2021 /usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose

Aug 17 07:36:54 moli nvidia-persistenced[2021]: Now running with user ID 123 and group ID 133
Aug 17 07:36:54 moli nvidia-persistenced[2021]: Started (2021)
Aug 17 07:36:54 moli nvidia-persistenced[2021]: device 0000:01:00.0 - registered
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - persistence mode enabled.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - NUMA memory onlined.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:21:00.0 - registered
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - persistence mode enabled.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - NUMA memory onlined.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: Local RPC services initialized
Aug 17 07:36:56 moli systemd[1]: Started NVIDIA Persistence Daemon.
$ nvidia-smi
Mon Aug 17 07:52:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   32C    P8     5W / 250W |     16MiB /  7981MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
|  0%   29C    P8    19W / 250W |      1MiB /  7982MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2650      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

strangely, listing devices via nvidia-smi or my program mcx -L printed the GPU info almost instantly, but clinfo was hit by this 2.6s delay 6 times:

$ grep -B9 '+\s*2\.6' log_clinfo.txt 
07:45:49.590064 (+     0.000050) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 13 <0.000018>
07:45:49.590096 (+     0.000032) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:45:49.590113 (+     0.000016) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230a9f0) = 0 <0.000292>
07:45:49.590418 (+     0.000305) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000022>
07:45:49.590454 (+     0.000035) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230a9c0) = 0 <0.000007>
07:45:49.590473 (+     0.000019) close(13) = 0 <0.000004>
07:45:49.590494 (+     0.000021) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230a8b0) = 0 <0.000006>
07:45:49.590513 (+     0.000018) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a060) = 0 <0.000261>
07:45:49.590799 (+     0.000285) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230a900) = 0 <2.649303>
07:45:52.240130 (+     2.649337) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230a740) = 0 <0.000191>
--
07:45:52.733005 (+     0.000033) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9f0) = 0 <0.000033>
07:45:52.733052 (+     0.000047) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230bbb0) = 0 <0.000045>
07:45:52.733110 (+     0.000058) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9d0) = 0 <0.000014>
07:45:52.733137 (+     0.000026) munmap(0x7f868a175000, 4096) = 0 <0.000006>
07:45:52.733154 (+     0.000017) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9d0) = 0 <0.000028>
07:45:52.733198 (+     0.000044) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8b0) = 0 <0.000094>
07:45:52.733304 (+     0.000106) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9c0) = 0 <0.000014>
07:45:52.733331 (+     0.000026) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000461>
07:45:52.733804 (+     0.000473) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9c0) = 0 <2.635809>
07:45:55.369639 (+     2.635838) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8c0) = 0 <0.000267>
--
07:45:55.378751 (+     0.000049) openat(AT_FDCWD, "/dev/nvidia1", O_RDWR) = 13 <0.000011>
07:45:55.378776 (+     0.000024) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:45:55.378792 (+     0.000015) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230a9f0) = 0 <0.000238>
07:45:55.379042 (+     0.000250) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000011>
07:45:55.379069 (+     0.000027) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230a9c0) = 0 <0.000005>
07:45:55.379086 (+     0.000017) close(13) = 0 <0.000004>
07:45:55.379104 (+     0.000017) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230a8b0) = 0 <0.000004>
07:45:55.379120 (+     0.000016) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a060) = 0 <0.000293>
07:45:55.379435 (+     0.000315) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230a900) = 0 <2.644974>
07:45:58.024453 (+     2.645021) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230a740) = 0 <0.000126>
--
07:45:58.385133 (+     0.000044) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9f0) = 0 <0.000039>
07:45:58.385189 (+     0.000056) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230bbb0) = 0 <0.000056>
07:45:58.385262 (+     0.000072) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9d0) = 0 <0.000018>
07:45:58.385295 (+     0.000033) munmap(0x7f868a175000, 4096) = 0 <0.000009>
07:45:58.385318 (+     0.000022) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9d0) = 0 <0.000033>
07:45:58.385372 (+     0.000053) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8b0) = 0 <0.000102>
07:45:58.385491 (+     0.000118) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230b9c0) = 0 <0.000019>
07:45:58.385525 (+     0.000034) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.000695>
07:45:58.386236 (+     0.000711) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230b9c0) = 0 <2.644142>
07:46:01.030418 (+     2.644183) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230b8c0) = 0 <0.000258>
--
07:46:01.101095 (+     0.000046) openat(AT_FDCWD, "/dev/nvidia0", O_RDWR) = 13 <0.000015>
07:46:01.101126 (+     0.000031) fcntl(13, F_SETFD, FD_CLOEXEC) = 0 <0.000004>
07:46:01.101144 (+     0.000017) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4e, 0x38), 0x7ffc7230ae70) = 0 <0.000119>
07:46:01.101278 (+     0.000133) mmap(0x200400000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 13, 0) = 0x200400000 <0.000018>
07:46:01.101311 (+     0.000033) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x5e, 0x28), 0x7ffc7230ae40) = 0 <0.000007>
07:46:01.101332 (+     0.000020) close(13) = 0 <0.000004>
07:46:01.101354 (+     0.000021) ioctl(5, _IOC(_IOC_NONE, 0, 0x49, 0), 0x7ffc7230ad30) = 0 <0.000005>
07:46:01.101374 (+     0.000019) ioctl(5, _IOC(_IOC_NONE, 0, 0x21, 0), 0x7ffc7230a4e0) = 0 <0.000068>
07:46:01.101466 (+     0.000091) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4a, 0xc0), 0x7ffc7230ad80) = 0 <2.647105>
07:46:03.748613 (+     2.647153) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffc7230abc0) = 0 <0.000104>
--
07:46:04.119888 (+     0.000055) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be70) = 0 <0.000049>
07:46:04.119954 (+     0.000066) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230c030) = 0 <0.000073>
07:46:04.120046 (+     0.000091) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230be50) = 0 <0.000022>
07:46:04.120082 (+     0.000036) munmap(0x7f868a175000, 4096) = 0 <0.000013>
07:46:04.120110 (+     0.000027) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be50) = 0 <0.000042>
07:46:04.120177 (+     0.000066) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230bd30) = 0 <0.000119>
07:46:04.120312 (+     0.000135) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x4f, 0x20), 0x7ffc7230be40) = 0 <0.000021>
07:46:04.120348 (+     0.000036) mmap(0x200600000, 58720256, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, 0, 0) = 0x200600000 <0.001037>
07:46:04.121403 (+     0.001054) ioctl(4, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x29, 0x10), 0x7ffc7230be40) = 0 <2.639314>
07:46:06.760744 (+     2.639344) ioctl(5, _IOC(_IOC_NONE, 0, 0x22, 0), 0x7ffc7230bd40) = 0 <0.000265>