I recently purchased a computing cluster with 10 nodes (Theadripper 3990X/ 4x GTX 2080Super/Ubuntu 20.04/256GB/TRX40 Creator/CUDA 10.1/440.100 driver/5.4 kernel). Everything seem to work fine, except that running my CUDA program has a rather long delay when initializing the GPU as well as freeing the device.
I saw a similar report when using NVIDIA GPU with Threadripper Linux server
Using strace -tt -T -r, I found out the below two ioctl(3, _IOC(_IOC_READ|_IOC_WRITE calls to access /dev/nvidia0 added about 2.65s each, while my actual job only took about 100 ms, so over 90% of the running time were wasted by these two ioctl calls.
Same test took only 0.2 s to finish on another machine with intel i7-7700k/TitanV/Ubuntu 16.04.
Same latency happened for OpenCL, but no latency was seen on the CPU jobs (SSE4), so I am pretty sure it was caused by accessing /dev/nvidia*.
can someone suggest how to debug this? my system was installed freshly recently by the vendor, all updates were applied.
the below log shows a 2.6s delay at the beginning of the code, and the bottom block shows the 2nd 2.6s delay when releasing the device.
Beyond that, CUDA startup includes building a unified virtual address map that incorporates all GPU memory and all system memory. The larger the system memory and the more GPUs in the system, the longer this takes, with almost all of that time spent in single-threaded operating system calls. A high-frequency CPU will help (I recommend > 3.5 GHz base frequency), and high-throughput system memory may help.
for testing purposes, I currently only have 2x 2080S in the box,
I confirm that nvidia-persistenced is running, and I also enabled persistent mode, see below nvidia-smi output.
However, these still result in the same latency and runtime.
fangq@moli:~/space/git/Project/github/mcx/test$ nvidia-smi
Sun Aug 16 22:51:33 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:01:00.0 Off | N/A |
| 0% 32C P8 5W / 250W | 16MiB / 7981MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:21:00.0 Off | N/A |
| 0% 29C P8 19W / 250W | 1MiB / 7982MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2627 G /usr/lib/xorg/Xorg 14MiB |
+-----------------------------------------------------------------------------+
fangq@moli:~/space/git/Project/github/mcx/test$ ps aux | grep persis
nvidia-+ 2034 0.0 0.0 5108 1624 ? Ss Aug12 0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose
I don’t have any additional insights into this. My system only has 32 GB of system memory and one 8 GB GPU, so I do not know what overhead to expect for building the virtual memory map on a system with 256 GB of system memory. I agree that 2.6 seconds startup overhead seems excessively slow.
thanks, I manually set the persistence mode to on for both GPUs, unfortunately it did not make a difference.
about 3 years ago, we built a GPU node with 12x 1080Ti, initially, I also had a similar latency issue, but running nvidia-persistenced reduced that latency to almost nothing - strange that this time it does not seem to make a difference.
is it possible to disable this virtual memory creation? or even skipping the host memory part?
There should definitely be a difference between running with --no-persistence-mode versus running with --persistence-mode. It seems to me you would want to double check on the persistence daemon status. Maybe monitor with watch -n 2 systemctl status nvidia-persistence?
The unified virtual memory space has been an essential feature of CUDA for many years, so creating the required memory map is mandatory. It shouldn’t take 2+ seconds, though.
I added --persistence-mode and [Install] section according to this post
after a reboot, persistence mode appears to be on for both GPUs, but when running either a cuda or the opencl code, the behavior is almost exactly the same - a 2.6s delay at the beginning and another 2.6s at the end.
$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; disabled; vendor preset: enabled)
Active: active (running) since Mon 2020-08-17 07:36:56 PDT; 14min ago
Process: 2018 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose (code=exited, status=0/SUCCESS)
Main PID: 2021 (nvidia-persiste)
Tasks: 1 (limit: 309057)
Memory: 988.0K
CGroup: /system.slice/nvidia-persistenced.service
└─2021 /usr/bin/nvidia-persistenced --user nvidia-persistenced --persistence-mode --verbose
Aug 17 07:36:54 moli nvidia-persistenced[2021]: Now running with user ID 123 and group ID 133
Aug 17 07:36:54 moli nvidia-persistenced[2021]: Started (2021)
Aug 17 07:36:54 moli nvidia-persistenced[2021]: device 0000:01:00.0 - registered
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - persistence mode enabled.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:01:00.0 - NUMA memory onlined.
Aug 17 07:36:55 moli nvidia-persistenced[2021]: device 0000:21:00.0 - registered
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - persistence mode enabled.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: device 0000:21:00.0 - NUMA memory onlined.
Aug 17 07:36:56 moli nvidia-persistenced[2021]: Local RPC services initialized
Aug 17 07:36:56 moli systemd[1]: Started NVIDIA Persistence Daemon.
$ nvidia-smi
Mon Aug 17 07:52:07 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:01:00.0 Off | N/A |
| 0% 32C P8 5W / 250W | 16MiB / 7981MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:21:00.0 Off | N/A |
| 0% 29C P8 19W / 250W | 1MiB / 7982MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2650 G /usr/lib/xorg/Xorg 14MiB |
+-----------------------------------------------------------------------------+
strangely, listing devices via nvidia-smi or my program mcx -L printed the GPU info almost instantly, but clinfo was hit by this 2.6s delay 6 times:
Bugs at NVIDIA are always confidential, because many bug reports include proprietary information of one kind or other. Consequently only the filer and relevant NVIDIA personnel gets to look at bug reports.
Your issue may be identical to an existing bug, or it may be an as-of-yet unreported issue. The only sure way to find out is to file a bug report yourself. Once NVIDIA engineering have a look at it, they will determine whether it is a duplicate or a new and separate issue.
same latency also happens on OpenCL based programs, here is the output of strace when running clinfo, there are total of six 2.6s delays in the entire call (run grep '+\s*2\.6' log_clinfo.txt to see the delays)
The only sure way to find out is to file a bug report yourself. Once NVIDIA engineering have a look at it, they will determine whether it is a duplicate or a new and separate issue.
Just want to follow up on this issue - I heard back from NVIDIA’s engineer that this issue is now fixed in a new driver release 450.66 (currently in focal-propose), I am very glad to confirm that the latency issue is gone.
Both cuda and opencl codes can start/end without that latency. If you ran into the same issue (Threadripper/Ryzen CPU+NVIDIA GPU), wait for this driver!