CUDA initialization takes long time that varies up to 30 seconds on Amazon p3.16xlarge Windows machi...

I’m trying to make use of p3.16xlarge instance of Amazon AWS. It has 8 Tesla V100 GPUs and runs Windows. I’ve tried both CUDA 10.1 and 10.2 with the corresponding latest versions of the Tesla driver: 426.23 and 441.22.

Unfortunately, what happens in my CUDA application (both the real app and a test app just to demonstrate the issue) is that it hangs for up to 30 seconds in the first call to CUDA, which is cudaGetDeviceCount() as you can see in the call stack below.

ntdll.dll!NtDeviceIoControlFile()   Unknown
KernelBase.dll!DeviceIoControl()    Unknown
kernel32.dll!DeviceIoControlImplementation()    Unknown
nvcuda.dll!00007fff471030f0()   Unknown
nvcuda.dll!00007fff471501ea()   Unknown
nvcuda.dll!00007fff4715ca07()   Unknown
nvcuda.dll!00007fff46fd2d8d()   Unknown
nvcuda.dll!00007fff46fd365f()   Unknown
nvcuda.dll!00007fff46fd3a1a()   Unknown
nvcuda.dll!00007fff46fd3df5()   Unknown
nvcuda.dll!00007fff46fd3f0d()   Unknown
nvcuda.dll!00007fff4700e436()   Unknown
nvcuda.dll!00007fff46fcbcdb()   Unknown
nvcuda.dll!00007fff46fcc7c2()   Unknown
nvcuda.dll!00007fff46fccd6c()   Unknown
nvcuda.dll!00007fff470280ae()   Unknown
Test1.exe!cudart::globalState::loadDriverInternal(void) C++
Test1.exe!cudart::__loadDriverInternalUtil(void)    C++
Test1.exe!cudart::cuosOnce(unsigned int *,void (*)(void))   C++
Test1.exe!cudart::globalState::loadDriver(void) C++
Test1.exe!cudart::globalState::initializeDriver(void)   C++
Test1.exe!cudaGetDeviceCount()  C++

Can you help with this problem?

A side issue along the way is that I was unable to load symbols for nvcuda.dll in the call stack above. If these symbols are at all published, how to load them?

Total amount of memory (GPU plus system memory) is a significant factor in CUDA initialization time, since all that memory needs to be mapped into a single unified memory map using operating system calls. I don’t know what the total amount of memory is for this configuration, but given that it sports 8 Tesla V100 I am guessing a lot, maybe around 512 GB?

That said, 30 seconds sounds excessive to me.

The RAM is 488 GB. After we reduced the number of video cards involved to a single one (with CUDA_VISIBLE_DEVICES=0), the initialization time (i.e. cudaGetDeviceCount() call) was reduced to about 6 seconds. So it seems the mapping occurs in a single CPU thread for every video card one by one. At least NVIDIA developers could reduce this delay by initializing video cards in parallel from different CPU threads.

Best I know, the OS calls for mapping memory are effectively non-parallelizable due to the use of a “giant lock”. So it is primarily high single-thread CPU performance that is needed to finish this task quickly. That implies high CPU clock frequency. My standing recommendation for GPU-accelerated system is to use CPUs with a base frequency >= 3.5 GHz, if possible.

I am not sure how the 30 second delay is causing issues. The CUDA startup overhead should be amortized over many minutes, even hours, of operation.

You may want to consider filing a feature request with NVIDIA for reduced startup overhead. It is reasonably safe to assume that only items tracked in the bug database are actionable.

Of course, 30 seconds delay is a lot. The task is small, computed in less than a second, and the users want an immediate response of the program. This issue in CUDA causes us to implement all the programs as servers, rather than just programs that take some files on input and produce output files. And this is a huge issue.

It seems operating systems are not designed for fast memory mapping. You could complain to Microsoft about it. The unified virtual memory address space is an important feature for CUDA, so the memory-mapping step is required. I would expect that the best NVIDIA engineers could do is go over the necessary sequence of OS API calls with a fine-toothed comb to see whether they can streamline it. Quite possibly they have done that already, but filing a feature request to ensure that is the case wouldn’t hurt.

As I stated in my initial reply, the 30 seconds startup overhead seems high even given the amount of memory in the system. You may want to try a Linux platform to see whether it is faster. I suspect it is (possibly by a factor of two or so), but I do not have any side-by-side data. Choosing a host with fast CPU(s) and fast system memory should help to, that might be another factor of two. I wonder whether the AWS instances are virtual machines that cause further performance degradation to the OS operations used in the mapping due to hypervisor overhead. You may want to run some experiments on bare metal to assess that.

The reality (for now) is that sub-second CUDA startup overhead is not going to happen with a system this large. As a practical approach maybe chose the smallest instance you can get away with.