Long delays on CUDA app startup causing Nsight System to fail on startup

Ubuntu 22.04 Server with Desktop installed
CUDA 12.1

I have been iterating with @liuyis in the Nsight Systems group regarding a problem with Nsight Systems nsys daemon startup.

After several experiments, it seems the failure happens because
the call to cuInit takes too long and Nsys times out, even though nvidia-persistenced is running.

Running strace while starting up a CUDA application revealed the system calls causing the long delay:

Thanks for sharing the result. I’m not seeing a problematic GPU device based on the log, however I do see the system calls that caused the long delay:

ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe4da090b0) = 0 <8.005325>
...
ioctl(4, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffe4da0c350) = 0 <24.041903>

and tracking back I can see the FDs they were trying to control were nvidiactl and nvidia-uvm:

openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 3 <0.000019>
...
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = 4 <0.000020>

Unfortunately this is out of the scope of Nsight Systems development, I suggest reporting the issue to CUDA - NVIDIA Developer Forums.

Is there someone here who can pick up this thread and help us resolve the issue? We also notice all our CUDA applications take about 30 seconds to startup, which is driving us crazy…

Thanks.
strace.txt (39.8 KB)

That’s not typical based on my experience. For example, if you purchased e.g. a Dell Workstation with a NVIDIA GPU in it, and loaded Ubuntu 22.04 linux on it, and then installed CUDA following our linux install guide, and then compiled and run sample codes, I’m quite confident you wouldn’t experience that.

So I suspect there is something about the environment that you are running in that is contributing to this observation.

I suggest we start there. Please describe your environment, both hardware and software.

Let me know what additional info you need.

Hardware:
Supermicro 4U Tower GPU Server
2 x Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz (16 Cores/32 Hyperthreads each)
256GB DDR4 3200MHz (16 x 16GB Sticks)
2 x PNY NVIDIA RTX A5500 PRO 24GB GDDR6 PCIe 4.0 x16 ECC Graphics Cards
2 x 800GB NVMe (Mirror OS)
2 x 1.92TB NVMe (Mirror Scratch)
1 X PNY NVIDIA NVLink 2-Slot for A Series Product
Dual hot swap power supplies
Total Eight 3.5 Inch Hot-Swap Drive Bays (4 used, 4 empty)
Integrated IPMI
Onboard VGA Graphics (no built-in controller in Xeon 6326)

Software
Came with Ubuntu 22.04 Server installed (no desktop GUI). Installed Ubuntu Desktop (GDK 3 launcher)

Kernel

sscott@demo:~/esat-rx$ uname -r
5.15.0-67-generic

Ubuntu

scott@demo:~/esat-rx$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.1 LTS
Release:	22.04
Codename:	jammy

CUDA

sscott@demo:~/esat-rx$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

lspci output is very long, see attached lspci.out.txt file.
lspci.out.txt (18.0 KB)

NVIDIA Driver

sscott@demo:~/esat-rx$ nvidia-smi
Fri Mar 17 17:13:13 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5500                On | 00000000:31:00.0 Off |                  Off |
| 30%   32C    P8               17W / 230W|      6MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5500                On | 00000000:4B:00.0 Off |                  Off |
| 30%   32C    P8               18W / 230W|      6MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      7007      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      7007      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

Nothing in the system configuration suggests a reason for lengthy CUDA startup to me. Maybe @Robert_Crovella can spot something worth a closer look. I do not have hands-on experience with Supermicro platforms as a 20+ year purchaser of Dell products, but Supermicro has an excellent reputation for high-performance systems going back some 25 years, making it unlikely that there is anything fundamentally messed up with the hardware platform.

Just to clarify: When you run a fairly minimal CUDA program, do you also observe this 30 second delay? For example the simpleCUBLAS sample app included in NVIDIA’s collection of sample apps.

Is there any kind of extensive logging or monitoring of OS-level activity enabled on this system when the CUDA startup delay is measured? Is any kind of virtualization in effect? Is any kind of disk or memory compression turned on, or a security feature like a real-time malware scanner?

Are the measurements of CUDA application execution time performed on a system with minimal system load? Ideally one would establish a baseline on a freshly booted system where the CUDA app is the only user-land app running.

The startup cost of CUDA is largely in the creation of the CUDA context, which correlates significantly with the amount of memory in the system and somewhat with the number of GPUs in the system. Context creation happens lazily and is typically triggered by the first CUDA API call. An easy way to isolate the startup cost is to invoke cudaFree(0); early in the program and time the execution time of this API call with a high-resolution timer like gettimeofday(). Based on the description of the system configuration, I would expect it to take less than 2 seconds for this system.

Just to clarify: When you run a fairly minimal CUDA program, do you also observe this 30 second delay?

Yes. Even this simple app which just calls cuInit has this problem.

Is there any kind of extensive logging or monitoring of OS-level activity enabled on this system when the CUDA startup delay is measured? Is any kind of virtualization in effect? Is any kind of disk or memory compression turned on, or a security feature like a real-time malware scanner?

No logging or monitoring or compression or scanning that I am aware of. The server had a minimal standard native Ubuntu Server 22.04 installed, no desktop GUI. I have installed everything else on this machine. It is running in a RAID configuration (it came that way), and we noticed some slowness in our disk accesses which we assumed (??) were due to the software RAID (mdraid). Not sure if that affects the GPU access…?? (Disks and GPUs are on the PCIe bus.)

We are accessing the server via RDP, so xrdp is running. There is a VGA output on the system, and I think some Supermicro-added built-in graphics controller owns it (the Xeon processor doesn’t have built-in graphics, so I assumed the VGA controller is something Supermicro added along with their IPMI system). The X Window system seems to be aware of the GPUs - I’m not sure if it’s interfering with our cuda development or not.

From the strace log, it is two calls which are taking all the time:

ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x46, 0x2a, 0x20), 0x7ffe4da090b0) = 0 <8.005325>
...
ioctl(4, _IOC(_IOC_NONE, 0, 0x25, 0), 0x7ffe4da0c350) = 0 <24.041903>

and tracking back I can see the FDs they were trying to control were nvidiactl and nvidia-uvm:

openat(AT_FDCWD, "/dev/nvidiactl", O_RDWR) = 3 <0.000019>
...
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = 4 <0.000020>

Other calls seem normal.

Are the measurements of CUDA application execution time performed on a system with minimal system load? Ideally one would establish a baseline on a freshly booted system where the CUDA apps is the only user-land app running.

Well, we are trying to run the GUI NSight system profiler, which requires a desktop, for a local GPU, and the profiler is failing to connect to the daemon because of the long startup delay. The system is running the Ubuntu desktop, but when I monitor system loading, there are 64 “cpus” available, most are idle. Other than the desktop and it’s associated window system, the server isn’t running anything except the cuda applications and associated tools.

This System Monitor snapshot is typical of system loading when we are trying to run/develop cuda apps.

I forgot to bring up one of the obvious reasons that could cause lengthy startup for a CUDA-accelerated application: JIT compilation overhead. If large amount of code needs to JIT-compiled, that could easily take 30 seconds.

JIT compilation may happen on purpose, such as when (possibly dynamically generated) PTX code is JIT-compiled, or inadvertently, when the wrong target architecture is specified during compilation or only PTX but no SASS (machine code) is embedded in the fat binary. RTX A5500 has compute capability 8.6, so you would want to specify the target architecture sm_86 for nvcc.

Just to confirm: The delay is observable when strace, CUDA tools (debugger, profiler). are not in play, just by measuring elapsed time with gettimeofday()?

yes

With the persistence demon up and running, this should not matter. A quick cross check would be to see whether the delay changes when the software RAID is turned off. I am not sure how involved performing such an experiment would be as I usually operate workstations without RAID, not servers.

I cannot think of anything else right now. I don’t know how one would gain visibility into what is happening under the hood of those ioctl calls. From the strace output it is clear that the majority of the delay is incurred mapping memory into a unified virtual memory map. Based on the amount of total memory in the system (system memory plus GPU memory) that should not take 24 seconds though on a mostly idle system with a relatively fast CPU.

If someone has access to a somewhat similarly configured system (I do not), it would be interesting to see what kind of UVM initialization time they observe. Maybe I was mistaken with my estimate of 2 seconds.

Is this command line correct/sufficient:

sscott@demo:~/esat-rx$ make cuInitTest
nvcc -gencode arch=compute_86,code=sm_86 -l cuda -o cuInitTest cuInitTest.cu
sscott@demo:~/esat-rx$ ./cuInitTest 
Time: 32083ms, Result: success

The compilation looks fine.

Right now it would be helpful to calibrate our expectations as to how fast memory mapping should take on a system with this much memory. I recall that we had previous reports of massive initialization overhead for large memory systems in this forum, but I no longer recall the details of those cases. On my Windows workstation here with 32 GB of system memory plus 13 GB of GPU memory across two GPUs CUDA initialization takes about 0.2 seconds.

One quick experiment would be to reduce the number of GPUs visible to CUDA to just one via the environment variable CUDA_VISIBLE_DEVICES to see whether that has any impact on the startup cost.

How to set this?

Oh wow, that helped!
sscott@demo:~/esat-rx$ ./cuInitTest
Time: 8041ms, Result: success

I found one of the older threads I mentioned, and the user there reported startup cost of 30 seconds for a system with 488 GB of system memory and 8 V100 GPUs (so presumably 8 x 16 GB = 128 GB of GPU memory).

The cost of memory mapping is largely attributable to OS API calls. I do not recall details, and do not know whether there are system settings that could impact the speed of those.

sscott@demo:~/esat-rx$ ./cuInitTest
Time: 8041ms, Result: success

So this is the effect of turning off one of the two GPUs in your system? If so, I have no explanation for the magnitude of the reduction (32 sec → 8 sec)

Yes.

I did this in .bashrc:
export CUDA_VISIBLE_DEVICES=0
But the Nsight System application still fails to connect to the daemon. It seems like it is still doing the 30 second timeout. My simple application, however, loads much faster.

I tried launching the profiler via a launch icon on the desktop and from a terminal command line, but it still has the long startup time and fails.

Sorry, I do not know what is going on. Perusing various older threads on CUDA initialization delay did not provide any insights beyond what we have discussed here. In one of the threads I found this:

Just tested on a Windows 10 x64 system with 128 GB DDR4 RAM and 2 GTX 1080TI GPUs and the first CUDA initialization time took about 570 ms.

This is for a system with roughly half the memory of the one discussed here, making me feel a bit more confident in the 2 second estimate I gave above, although additional hard data for a system substantially similar to the one discussed here would still be useful.

Yes, something odd is happening. If I disable device 1, load time is 8 sec. If I disable device 0, load time is 16 seconds. Enable both devices load time is 32 seconds.

sscott@demo:~/esat-rx$ unset CUDA_VISIBLE_DEVICES
sscott@demo:~/esat-rx$ ./cuInitTest 
Time: 32089ms, Result: success
sscott@demo:~/esat-rx$ export CUDA_VISIBLE_DEVICES=1
sscott@demo:~/esat-rx$ ./cuInitTest 
Time: 16044ms, Result: success
sscott@demo:~/esat-rx$ export CUDA_VISIBLE_DEVICES=0
sscott@demo:~/esat-rx$ ./cuInitTest 
Time: 8041ms, Result: success

It appears that Nsight Systems gui (nsys-gui) is ignoring CUDA_VISIBLE_DEVICES set in my local environment when it launches the nsys daemon…

I would suggest performing yet another sanity check: Are both GPUs communicating with the system via a PCIe gen4 x16 link?

PCIe links are dynamically configured to reduce power draw and energy consumption. Therefore one must observe the link configuration while a CUDA app is actively using the GPUs. Run nvidia-smi -q and look for the sections that look like this:

       GPU Link Info
           PCIe Generation
               Max                 : 4
               Current             : 4
           Link Width
               Max                 : 16x
               Current             : 16x

Both GPUs show:

        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

This doesn’t look right:

Was the CUDA application running and actively using PCIe transfers? Maybe create an ad-hoc app that just copies memory back and forth between host and device in a loop. The CPU in your system supplies 64 PCIe4 lanes, so there should definitely be enough bandwidth to hook up two GPUs with PCIe4 x16 link each.

Any chance these GPUs were installed into the wrong PCIe slots? The system documentation should describe clearly which slots are the high-performance x16 slots suitable for GPUs.

Another possibility: If a GPU is supplied with insufficient power, it may fall back to basic functionality only. With a few exceptions, NVIDIA’s GPUs draw at most 40W through the PCIe slot (the spec allows up to 75W), the balance being provided by PCIe auxiliary power connectors.

Per its datasheet, the RTX A5500 draws up to 230W, but sports a single 8-pin PCIe auxiliary power connector, which can deliver 150W per PCIe specification. This means that the card draws the maximum allowed 75W through the PCIe slot. I know this sums to just 225W. Presumably there is either a documentation issue or the card is actually designed to draw a smidgen more than the spec limit through the auxiliary power connector (a small excess there is not critical; excess at the slot connector is).

Make sure the 8-pin auxiliary power cables are hooked up and the connectors fully engaged (typically a tab snaps into a small groove). Regarding the power cables, there should be no Y-splitters, pin-count converters, or daisy chaining involved.