Really slow nvidia-smi, cuda initialization or context creation (L40)

We have a system with 8x L40 and nvidia-smi, cuda initialization, context creation or encoding/decoding are really slow processes. Moreover, they become slower as the load on the server increases.

For example:

—without load (empty GPUs, no processes)—
nvidia-smi → 1.2s
cuda init → 500 ms

–with load (8 video decoding processes with pyav + ffmpeg)----
nvidia-smi → 2.8s
cuda init → 4s

If we increase the load they become even slower.

We are currently using driver 560.28.03, but we have also tried 550.90.07 and 535.183.06 all with the same results.

Isolating 1 GPU (either through CUDA_VISIBLE_DEVICES or by draining the rest of GPUs) doesn’t have any effect on the latency of previous processes.

I attach the nvidia-bug-report, as well as the result of the following cuda samples:

  • 1_Utilities/deviceQuery
  • 5_Domain_Specific/p2pBandwidthLatencyTest
  • 1_Utilities/bandwidthTest

devicequery.txt (21.2 KB)
bandwidth.txt (591 Bytes)
p2pbandwidth.txt (8.0 KB)
nvidia-bug-report.log.gz (7.1 MB)

Our system is composed of 2 NUMA cores, each with 4 GPUs. The output of nvidia-smi topo -mp is:

What could be happening for these processes to be so slow?

The script for benchmarking cuda initialization is:

#include <cuda.h>

#include <chrono>
#include <cstdio>

int main()
{
    auto start = std::chrono::high_resolution_clock::now();
    CUresult result = cuInit(0);
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);

    printf("Time: %ldms, Result: %s\n", duration.count(), result == CUDA_SUCCESS ? "success" : "failed");

    return 0;
}

CUDA initialization takes longer with more GPUs exposed to the CUDA runtime, as well as with more system memory in view. This is a CPU-bound process, so loading the CPU with other activity during CUDA initialization will likely slow things down further.

A few general suggestions:

  • enable persistence mode for all GPUs
  • use the CUDA_VISIBLE_DEVICES variable to restrict a process to the GPUs it will actually use.

These suggestions may not help or you may have already done them. I don’t have anything further to suggest.

I am not sure what you are timing in the case of nvidia-smi.

CUDA initialization is pretty much all host-side activity, much of it single threaded. It is not surprising that this takes longer when there is substantial load on the host. I do not know the nature of the “8 video decoding processes”, but presumably they add CPU / system memory load in addition to GPU load.

I have not used an L40, but from what I can see from the data in the TechPowerUp database this appears to be the professional-line equivalent of an RTX 4090, i.e. the top-tier Ada architecture GPU. You need a powerful host to achieve a well-balanced system.

What are the host system specifications?

Thanks @Robert_Crovella and @njuffa for your prompt responses.

@Robert_Crovella we have persistence enabled always on our systems.

Also, we already restrict processes with CUDA_VISIBLE_DEVICES (running on host, docker, or k8s). However, it still seems to get slower when the load in the system increases.

@njuffa we do time nvidia-smi to check on nvidia-smi loading times. We are aware is not the best tool for benchmarking but allows us to compare across servers.

About the CPU, we have the following for 8x L40

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  384
  On-line CPU(s) list:   0-383
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 9654 96-Core Processor
    CPU family:          25
    Model:               17
    Thread(s) per core:  2
    Core(s) per socket:  96
    Socket(s):           2
    Stepping:            1
    Frequency boost:     enabled
    CPU max MHz:         3707.8120
    CPU min MHz:         1500.0000
    BogoMIPS:            4800.17
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3
                         fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp
                         _l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc
                         cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ct
                         rl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization features:
  Virtualization:        AMD-V
Caches (sum of all):
  L1d:                   6 MiB (192 instances)
  L1i:                   6 MiB (192 instances)
  L2:                    192 MiB (192 instances)
  L3:                    768 MiB (24 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-95,192-287
  NUMA node1 CPU(s):     96-191,288-383
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Info from dmidecode --type system and baseboard attached

baseboard.txt (814 Bytes)
system.txt (2.3 KB)

That would not be my choice for a GPU-accelerated system. In a GPU-accelerated system, the parallel part of the workload should run on the GPU, and the serial portion of the workload remains on the CPU. In order to prevent bottlenecking on the serial portion (something that has been observed in practice and is not just a theoretical concern), a CPU with fast single-thread performance is indicated. To first order, that means high clocks. My standing recommendation is > 3.5 GHz base clock.

The EPYC 9654 clocks are 2.4 GHz base / 3.7 GHz boost, so it is on the lower side. That is not surprising given the high core count of 96 (with 192 threads). You have a dual socket system. I cannot think of many CPU-based workloads that make good use of 192 physical cores (with 384 threads). Maybe you are running such workloads and need all these CPU cores; I obviously don’t know.

In GPU-accelerated systems, as a rule of thumb, 4 CPU cores per GPU are usually sufficient to keep the GPUs well fed. Something like an EPYC 9474F (48 cores, 3.6 GHz base, 4.1 GHz boost) would be a better choice in my view, and one might consider going even smaller like an EPYC 9374F (32 cores, 3.85 GHz base, 4.3 GHz boost).

All these Genoa-class EPYC CPUs have a twelve-channel memory controller, so a dual-socket system provides 24 DDR5 channels in total. You would want to populate all channels, with at least 32GB x 24 = 768 GB installed in the system. Go bigger if you can afford it. As a rule of thumb, system memory should be 2x to 4x the total GPU memory.

As @Robert_Crovella pointed out, CUDA initialization time increases with the total amount of memory (system memory + GPU memory) in the system. The reason for this is that all memory needs to be mapped into a single unified virtual address space, which involves many OS API calls. For a large system with 8 GPUs the initialization time of 500 ms you have reported for the unloaded system looks already on the fast side to me. Using a faster (higher clocked) CPU should improve this time incrementally.

1 Like

Thank you very much @njuffa ! We will try to check our system with a CPU with higher clock.

Our problem is whenever the load in the system increases, the start time is not 500 ms anymore but seconds.

When we profile cuInit(0) through nvidia nsight we see a lot of ioctl calls that make up most of the trace in terms of time. What could this mean?

report1.nsys-rep.zip (900.4 KB)

I understand and this is expected and not unusual. Multiple processes are competing for host system resources and this causes a slowdown. In the case of CUDA initialization, much of the activity consist of mapping memory via operating system calls. I am not an expert on this, but my understanding is that at least for some operating systems, these OS calls often need to grab a global lock. That means only one thread can run after acquiring the lock, and other threads have to wait until the lock is released. That is one of the reasons single-thread CPU performance matters.

That is as I would expect it. Without looking at the details, this likely corresponds to calls into the kernel portion of the NVIDIA driver stack and in turn OS API calls.