Long delays on CUDA app startup causing Nsight System to fail on startup

Was the CUDA application running an actively using PCIe transfers? Maybe creat a quick ad-hoc app that just copies memory back and forth in a loop. The CPU in your system supplies 64 PCIe4 lanes, so there should definitely be enough bandwidth to hookup two GPUs with PCIe4 x16 link each.

Any chance these GPUs were installed into the wrong PCIe slots? The system documentation should describe clearly which slots are the high-performance x16 slots suitable for GPUs.

Was the CUDA application running an actively using PCIe transfers? Maybe creat a quick ad-hoc app that just copies memory back and forth in a loop. The CPU in your system supplies 64 PCIe4 lanes, so there should definitely be enough bandwidth to hookup two GPUs with PCIe4 x16 link each.

The cuda app I’ve been using for iterating on this issue only calls cuInit. No other cuda calls, data transfers, etc.

Any chance these GPUs were installed into the wrong PCIe slots? The system documentation should describe clearly which slots are the high-performance x16 slots suitable for GPUs.

It’s possible. I won’t be able to check until Monday.

The app should definitely actively use the PCIe link. Write a simple app that allocates a chunk of memory on the host, an identically sized chunk of memory on the device, and then copies back and forth between them in a continuous loop.

If running that does not cause “PCIe gen1” to change to “PCIe gen4” in the “Current” setting, check the PCIe slot assignments and the power cable hookup on the GPUs.

Here is a bare-bones program that loads up the PCIe link between host and device using bi-directional traffic. On my PCIe gen3 system it reports a combined throughput of 21.2 GB/sec, it should report twice that for a PCIe gen4 system.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BUF_SIZE  (128 * 1024 * 1024)
#define MAX_ITER  (10000)
#define DEVICE_ID (0)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
    double start, stop, elapsed;
    cudaStream_t stream[2];
    unsigned char * dbuf[2] = {0, 0};
    unsigned char * hbuf[2] = {0, 0};
    size_t totalSize;
    struct cudaDeviceProp props;

    cudaSetDevice (DEVICE_ID);
    cudaGetDeviceProperties (&props, DEVICE_ID);
    printf ("running on device %d (%s)\n", DEVICE_ID, props.name);
    cudaStreamCreate (&stream[0]);    
    cudaStreamCreate (&stream[1]);
    cudaMalloc ((void**)&dbuf[0], BUF_SIZE);
    cudaMalloc ((void**)&dbuf[1], BUF_SIZE);
    cudaMallocHost ((void**)&hbuf[0], BUF_SIZE);
    cudaMallocHost ((void**)&hbuf[1], BUF_SIZE);
    
    start = second();
    for (int i = 0; i < MAX_ITER; i++) {
        cudaMemcpyAsync (dbuf[0], hbuf[0], BUF_SIZE,
                         cudaMemcpyHostToDevice, stream [0]);
        cudaMemcpyAsync (hbuf[1], dbuf[1], BUF_SIZE,
                         cudaMemcpyDeviceToHost, stream [1]);
    }
    cudaDeviceSynchronize();
    totalSize = ((size_t)BUF_SIZE) * MAX_ITER * 2;
    stop = second();
    elapsed = stop - start;
    printf ("PCIe throughput (both directions combined): %.2f GB/sec\n", 
            totalSize / elapsed / 1024 / 1024 / 1024);
    cudaFreeHost (hbuf[0]);    
    cudaFreeHost (hbuf[1]);
    cudaFree (dbuf[0]);    
    cudaFree (dbuf[1]);
    cudaStreamDestroy (stream[0]);
    cudaStreamDestroy (stream[1]);
    return EXIT_SUCCESS;
}

I’m unfamiliar with that. (Yes, I can google what it is, I’ve just never used it, nor in my dozen or so years of helping others with CUDA have I come across that as an access method.) There used to be problems on windows with RDP access to a windows server for CUDA work. Those were ironed out by NVIDIA a few years ago.

As an experiment, could you try to access via e.g putty or Mobaxterm or another ssh facility?

I’d like to make it clear whether you are using containers, or VMs, or any other such method to encapsulate what you are doing? If you are using e.g. a container to do all your work, please try running “bare metal”.

How exactly did you install the CUDA toolkit, and the GPU driver. If you look at the linux install guide I previously linked, you will note that there are two separate installation methods, one I refer to as the runfile install method, and one the package manager install method. Did you install the GPU driver as part of the CUDA toolkit install, or separately?

I think a good experiment would be to shut down X windows. There isn’t anything wrong with it, per-se, but we are trying to find a problem. An easy and conclusive way to do this is to use the systemctl method to set runlevel to 3 (they call it to set to multi-user) and reboot. Then rerun some tests from the command line/baremetal.

I believe the current PCIE generation indication of 1 is simply a power-saving feature. However, the question I have is, were these GPUs installed in the server when it was provided by Supermicro? Or were they installed sometime after delivery of the server?

If you have not already done so, can you please update the Supermicro server to the latest system BIOS available for that unit from Supermicro? Could you please also provide the system model number or the motherboard model number?

I used a terminal to ssh into the server, ran the cuInitTest which only calls cuInit. The results were the same - 32 second delay on startup with both GPUs enabled.

No containers. No VM. Nothing else.

I used the package manager install as described in the install guide. The GPU driver was installed as part of the toolkit.

I will not be able to get to this experiment right away. Since we need the GUI to run the tool, if it works without X Windows it’s not really a solution for us.

The GPUs were installed either by the system vendor or Supermicro, I’m not sure which. It arrived at our facility with the GPUs installed (we didn’t install them). The system vendor customizes the configuration and resells the Supermicro server.

I won’t be able to get to this anytime soon.

Motherboard model number is X12DPG-QT6
System model number is 740GP-TNRT

From the tests we did last week, the most promising results were setting:
CUDA_VISIBLE_DEVICES=0 <<< 8 second startup of cuInitTest
CUDA_VISIBLE_DEVICES=1 <<< 16 second startup of cuInitTest
unset CUDA_VISIBLE_DEVICES <<< 32 second startup of cuInitTest

cuInitTest only calls cuInit. Nothing else.

If you could run the little “PCIe stress test” program I posted above that would be helpful for further exclusion of potential problem sources. While this program is running (a minute or so), nvidia-smi should show “PCIe generation | current” as 4.

I ran one of our programs which transfers 15GB/sec to GPU 0.

        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 1 isn’t being used:

        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

When running on GPU 1, the link changes as expected:

        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

Output from the sample test program you provided:

sscott@demo:~/esat-rx$ ./nvtest
running on device 0 (NVIDIA RTX A5500)
PCIe throughput (both directions combined): 42.30 GB/sec

The bare-bones program only tests the GPU with the device ID specified in the #define at the top of the code. Which is 0 in the version I posted. So the above output from nvidia-smi makes sense, and is as expected: Under load, PCIe Generation switches to 4.

Throughput of 42.3 GB/sec full-duplex looks as expected.

So at this time it looks like a major component of CUDA initialization time is due to each GPU added, and that the GPUs themselves are working just fine including communication with them. I have no idea what could be going on. Why would mapping 24 GB of GPU memory require a full 8 seconds per the data reported above? I have never used Ubuntu (preferring RHEL for systems that need to work) and I don’t know the details of the memory mapping process.

When I looked through old forum threads where people had reported what looked like excessive CUDA startup overhead, it didn’t look like a particular root cause was ever identified.

I am wondering whether the system is actually busy working on something during the 30 second startup, or whether it is mostly stalled waiting for something (trying to acquire a lock for a shared resource, maybe). If there is no or only minimal CPU load observable during this time, that would indicate the latter.

cpu usage is maxed out while running cuInitTest, which only calls cuInit.

Unfortunately that is inconclusive. The CPU could be busy running some mapping algorithm or busy waiting on a lock. Can we tell whether it is busy in system space or user space? If the former, time would be attributable to the OS itself and/or NVIDIA’s kernel mode driver, if the latter, presumably attributable to the CUDA driver. What I would expect to see is close to 100% system space activity.

This is the program that was running:

int main()
{
    auto start = std::chrono::high_resolution_clock::now();
    CUresult result = cuInit(0);
    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);

    printf("Time: %ldms, Result: %s\n", duration.count(), result == CUDA_SUCCESS ? "success" : "failed");

    return 0;
}

The only call of significance here is cuInit. From our strace logs, we know the system is spending a lot of time (24 seconds) in an ioctl call. The system monitor snapshot was taken about 10 seconds after the program started, so it’s pretty safe to assume that it was in the ioctl that takes 24 seconds to complete. So the montior showing a 100% maxed out Virtual Core has to be running in kernel space, right? If you have a way for me to gather info to answer your question let me know, but I don’t know how to do it. I’m pretty sure we can infer from the system monitor and our trace log that we are busy in system space, not user space. I have no idea how to figure out whether or not it’s the OS or the nvida driver.

Sorry, it has been twenty years since I last worked with low-level Linux stuff (or any low-level OS stuff, for that matter). From the information provided it seems your repro scenario is as easy as it gets from a user-land software perspective. And it repros consistently on your platform, which is good. It should be possible to get to the bottom of it, it just needs people with the right knowledge and skill set to do so.

IMHO filing a bug with NVIDIA would be appropriate. Question is: If you were to file a bug with NVIDIA, what are the chances it would repro in-house at NVIDIA? That would need to happen before NVIDIA developers look into the issue. The fact that I have seen posts about excessive CUDA startup overhead only occasionally over the years would seem to indicate that this is a fairly rare occurrence. If you file a bug and NVIDIA cannot repro it on their systems, would you be prepared to ship them your system for inspection? I am mentioning this as a hypothetical scenario; I don’t know what NVIDIA’s process is in such situations.

My understanding based on information provided in this thread was that no virtualization is in play. I assume by “virtual cores” you mean dual CPUs with 16 physical cores per CPU each plus hyperthreading enabled in the SBIOS, resulting in 64 cores visible to the operating system.

With the hypothesis that the vast majority of CUDA initialization time is spent mapping memory, I wonder whether fragmentation of the memory space could play an issue here. One indication of this would be if the startup time is significantly reduced when the CUDA program runs on a freshly booted machine. If you give that a try, it might also be a good time to try the runlevel 3 experiment suggested by @Robert_Crovella.

Correct.

May well be a “red herring”, but I note the cards are RTX/Quadro and the latest Production driver for this class of card is 525.89.02.

Maybe worth trying if all else fails?

No objections to experiments, but downgrading to 525.xx driver will require downgrading from CUDA 12.1 to CUDA 12.0, just as a point of information.

Also, I’d like to advise of possible support avenues here:

  1. Contact the system vendor for support. it’s possible this is a hardware issue, and furthermore they have their own support path to NVIDIA.

  2. Purchase a support license that entitles you to NVIDIA enterprise support, such as via a license to NVIDIA AI Enterprise. You should be able to purchase this from your system vendor.

  3. File a bug. If you file a bug, you will likely be asked for a set of steps that allow us to reproduce the observation. I don’t know how feasible that is; it may or may not impede progress.

  4. Use forums/community-based support as you are doing here.

In addition to the two requests I have outstanding, I’d like to try to confirm that the nouveau driver is not somehow involved. Could you run the following command:

lsmod |grep nouveau
sscott@demo:~$ lsmod |grep nouveau
sscott@demo:~$