CUDA initialization takes long time that varies up to 30 seconds on Amazon p3.16xlarge Windows machi...

serge.rogatch · December 8, 2019, 6:52am

I’m trying to make use of p3.16xlarge instance of Amazon AWS. It has 8 Tesla V100 GPUs and runs Windows. I’ve tried both CUDA 10.1 and 10.2 with the corresponding latest versions of the Tesla driver: 426.23 and 441.22.

Unfortunately, what happens in my CUDA application (both the real app and a test app just to demonstrate the issue) is that it hangs for up to 30 seconds in the first call to CUDA, which is cudaGetDeviceCount() as you can see in the call stack below.

ntdll.dll!NtDeviceIoControlFile()   Unknown
KernelBase.dll!DeviceIoControl()    Unknown
kernel32.dll!DeviceIoControlImplementation()    Unknown
nvcuda.dll!00007fff471030f0()   Unknown
nvcuda.dll!00007fff471501ea()   Unknown
nvcuda.dll!00007fff4715ca07()   Unknown
nvcuda.dll!00007fff46fd2d8d()   Unknown
nvcuda.dll!00007fff46fd365f()   Unknown
nvcuda.dll!00007fff46fd3a1a()   Unknown
nvcuda.dll!00007fff46fd3df5()   Unknown
nvcuda.dll!00007fff46fd3f0d()   Unknown
nvcuda.dll!00007fff4700e436()   Unknown
nvcuda.dll!00007fff46fcbcdb()   Unknown
nvcuda.dll!00007fff46fcc7c2()   Unknown
nvcuda.dll!00007fff46fccd6c()   Unknown
nvcuda.dll!00007fff470280ae()   Unknown
Test1.exe!cudart::globalState::loadDriverInternal(void) C++
Test1.exe!cudart::__loadDriverInternalUtil(void)    C++
Test1.exe!cudart::cuosOnce(unsigned int *,void (*)(void))   C++
Test1.exe!cudart::globalState::loadDriver(void) C++
Test1.exe!cudart::globalState::initializeDriver(void)   C++
Test1.exe!cudaGetDeviceCount()  C++

Can you help with this problem?

A side issue along the way is that I was unable to load symbols for nvcuda.dll in the call stack above. If these symbols are at all published, how to load them?

njuffa · December 8, 2019, 10:15am

Total amount of memory (GPU plus system memory) is a significant factor in CUDA initialization time, since all that memory needs to be mapped into a single unified memory map using operating system calls. I don’t know what the total amount of memory is for this configuration, but given that it sports 8 Tesla V100 I am guessing a lot, maybe around 512 GB?

That said, 30 seconds sounds excessive to me.

serge.rogatch · December 8, 2019, 10:29am

The RAM is 488 GB. After we reduced the number of video cards involved to a single one (with CUDA_VISIBLE_DEVICES=0), the initialization time (i.e. cudaGetDeviceCount() call) was reduced to about 6 seconds. So it seems the mapping occurs in a single CPU thread for every video card one by one. At least NVIDIA developers could reduce this delay by initializing video cards in parallel from different CPU threads.

njuffa · December 8, 2019, 4:09pm

Best I know, the OS calls for mapping memory are effectively non-parallelizable due to the use of a “giant lock”. So it is primarily high single-thread CPU performance that is needed to finish this task quickly. That implies high CPU clock frequency. My standing recommendation for GPU-accelerated system is to use CPUs with a base frequency >= 3.5 GHz, if possible.

I am not sure how the 30 second delay is causing issues. The CUDA startup overhead should be amortized over many minutes, even hours, of operation.

You may want to consider filing a feature request with NVIDIA for reduced startup overhead. It is reasonably safe to assume that only items tracked in the bug database are actionable.

serge.rogatch · December 8, 2019, 5:16pm

Of course, 30 seconds delay is a lot. The task is small, computed in less than a second, and the users want an immediate response of the program. This issue in CUDA causes us to implement all the programs as servers, rather than just programs that take some files on input and produce output files. And this is a huge issue.

njuffa · December 8, 2019, 5:43pm

It seems operating systems are not designed for fast memory mapping. You could complain to Microsoft about it. The unified virtual memory address space is an important feature for CUDA, so the memory-mapping step is required. I would expect that the best NVIDIA engineers could do is go over the necessary sequence of OS API calls with a fine-toothed comb to see whether they can streamline it. Quite possibly they have done that already, but filing a feature request to ensure that is the case wouldn’t hurt.

As I stated in my initial reply, the 30 seconds startup overhead seems high even given the amount of memory in the system. You may want to try a Linux platform to see whether it is faster. I suspect it is (possibly by a factor of two or so), but I do not have any side-by-side data. Choosing a host with fast CPU(s) and fast system memory should help to, that might be another factor of two. I wonder whether the AWS instances are virtual machines that cause further performance degradation to the OS operations used in the mapping due to hypervisor overhead. You may want to run some experiments on bare metal to assess that.

The reality (for now) is that sub-second CUDA startup overhead is not going to happen with a system this large. As a practical approach maybe chose the smallest instance you can get away with.

Topic		Replies	Views
64 bit Windows 10, gtx 1060, CUDA kernel startup time? CUDA Programming and Performance	12	2843	October 10, 2017
why 2.9 seconds to start tesla K20 CUDA Programming and Performance	12	1224	March 4, 2018
Slow CUDA programs' startup CUDA Programming and Performance	10	7265	January 23, 2012
why any CUDA program takes more than 1s? driver initialization time? CUDA Programming and Performance	7	3396	March 25, 2009
Device initialization takes 60 Seconds CUDA Programming and Performance	7	598	July 24, 2023
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23168	July 8, 2011
Persistence Daemon and Slow Initialization CUDA Programming and Performance	1	1137	December 18, 2018
CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times CUDA Programming and Performance	21	18960	November 11, 2009
really slow cudaGetDeviceCount() several seconds to complete a cudaGetDeviceCount() call CUDA Programming and Performance	3	1213	May 18, 2011
Tesla K80 poor performance on Azure CUDA Setup and Installation	1	815	June 5, 2019

CUDA initialization takes long time that varies up to 30 seconds on Amazon p3.16xlarge Windows machi...

Related topics