64 bit Windows 10, gtx 1060, CUDA kernel startup time?

MutantJohn · January 4, 2017, 4:14pm

So, I’ve never really noticed this before but now whenever I run some CUDA code, there’s this really hard sort of lag that I hit. It seems to happen no matter what my CUDA code is actually doing. I think this may have something to do with a recent update to Windows 10.

I’m really just wondering, has anyone else doing dev on Windows noticed this?

I know the rule of thumb is to always post a sample of working code but I’m at work and I was testing with my home project. This is a link to the project. Shifting the CMakeLists.txt files in the debug and release directories shouldn’t be too bad and they also contain my simple main.cu files.

In the case of my release build, I profiled it. And everything is profiling as it was before. It’s just there’s this weird start-up time that seems to show up no matter what. My GPU is fine though. I can play Dark Souls 3 and my GPU fans don’t even turn on.

Edit:

I’ll go home tonight and see if I can create the smallest possible replicable example that exhibits the behavior I’m experiencing. I honestly just think it’s any call to CUDA that gives me this weird stumble in performance.

MutantJohn · January 6, 2017, 9:32pm

Sorry for the late reply but even this code takes a decent chunk of time to execute:

// main.cu
int main(void)
{
  cudaDeviceSynchronize();
  return 0;
}

The good news is, I feel a lot less bad about my code being slow now.

Compiling and executing this code with nothing but a simple:

nvcc -o demo main.cu

generates an executable which takes around 2.239 seconds for me! I finally found out how to time a command in PowerShell :P

So it seems like any initial call to the CUDA API incurs significant overhead for me.

Edit:

It’s worth noting that I’m compiling this in the Visual Studio developer command prompt and I set the vcvarsall to x86_amd64 because I finally have enough VRAM to break the 4 GB barrier and I’m gonna use it!

njuffa · January 6, 2017, 9:42pm

CUDA is stateful, and the underlying state must be initialized. This initialization is usually triggered by the first call to any CUDA ABI.

In general, I would expect the initialization time to be in the hundreds of milliseconds, not multiple seconds. Past reports of long initialization times were usually on systems with very large system memory, and/or several GPUs each with large memory. I do not know the details, but part of CUDA initialization maps all device memory and all system memory into a single virtual address space, and the time for this initialization increases with the amount of memory that needs to be mapped. I believe the speed of this process is largely determined by the performance of operating system calls used to apply the mapping(s).

Note that by default, nvcc compiles for an sm_20 target, and that JIT compilation overhead will be incurred at CUDA startup time if your GPU has a different (higher!) compute capability, which is exceedingly likely at this time. This shouldn’t matter for minimal code like this, but it can bite you with real applications. Compile your code for all relevant compute capabilities, resulting in a fat binary.

MutantJohn · January 6, 2017, 9:51pm

njuffa, you’re always so awesome!

I’ll try cross-compiling my code as well. I do have an Ubuntu partition set up, I just haven’t used it yet because… well, videogames mostly.

If this is partially OS-dependent, then it’s fine for Windows to suck. I’m hoping they CUDA devs at Nvidia will eventually work around this but until then, this has definitely assuaged my worries.

I feel like GPUs went from relatively little memory to a lot of memory pretty quickly. Like, when I got my 2 GB 750, I was like, whoa. But then all of a sudden, AMD has a cards with 8 GB and the 1060 has 6 GB. I’m almost not prepared for this brave new world. I really regret hard-coding int as my default array index type now.

SPWorley · January 6, 2017, 9:59pm

Even for the context creation that does feel slow. I bet njuffa is right and it’s a JIT pass.

But another thought: are you running an Optimus laptop? Perhaps when you start the program the NVidia GPU is turned off entirely and Windows is running only the processor’s IGP. Starting a CUDA kernel would then require powering on the whole NVidia GPU hardware itself up as well, which likely takes more time than mere CUDA runtime initialization.

Try opening a WebGL page to make sure the NVidia GPU is running, leaving it open, then run your program again.

The IGP/NVidia switching is an inelegant pain in Mac OSX laptops like the MacBook Pro… the CUDA device is not even queryable when the IGP is being used, so you have to kickstart CUDA apps by opening an OpenGL window before creating the CUDA context first. After the context is created, it works fine.

njuffa · January 6, 2017, 10:03pm

While Windows 10 may be worse than other operating systems in terms of this memory mapping overhead (possibly due to WDDM 2.0), the issue generally exists on all operating systems supported by CUDA. I seem to recall that multiple previous reports of lengthy initialization time were for example Linux-based servers with on the order of 128 GB of memory and four or eight Tesla GPUs, requiring on the order of 256 GB of total memory to be mapped.

When encountering weird overhead issues on Windows, one should first switch the GPU from the WDDM driver to the TCC driver, if possible (not possible with GTX 1060), and then also try Linux, if possible.

Note that using ‘int’ for array indexing is still able to access 8 GB per array for an array of ‘float’ or ‘int’ elements, so depending on the context of an application that may still be fine, even when using GPUs with large memory. On the other hand, for a general purpose library, where even larger arrays may be encountered, and the size is not under control of the library writer, indexing with ‘long long int’ seems appropriate.

MutantJohn · January 6, 2017, 10:24pm

As an update, I tried compiling with a specification for the 6.1 architecture and this is my output:

// nvcc -gencode arch=compute_61,code=sm_61 -o demo main.cu
PS C:\Users\christian\cuda\demo> Measure-Command { .\demo.exe }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 2
Milliseconds      : 217
Ticks             : 22177295
TotalDays         : 2.56681655092593E-05
TotalHours        : 0.000616035972222222
TotalMinutes      : 0.0369621583333333
TotalSeconds      : 2.2177295
TotalMilliseconds : 2217.7295

@SPWorley I am not running on a laptop. Right now I’m running a desktop that I built myself.

MutantJohn · January 6, 2017, 10:28pm

Oh snap, I figured it out!

It’s because I’m compiling as a 64 bit application!

njuffa, I think you’re completely correct about the memory mapping thing O_o

If I compile as a 32 bit executable, the code I posted runs in like 184ms compared to the 2s when I compile as a 64 bit application.

njuffa · January 7, 2017, 3:38am

That’s an interesting lead, maybe txbob has some ideas about that. I haven’t built any 32-bit code in years.

Robert_Crovella · January 7, 2017, 3:58pm

64 bit turns on UVA, 32 bit turns it off.

UVA has implications for initialization of the virtual address space.

I think this (initialization) may have gotten slower in windows 10 also. Many win10 bugs are being fixed in the driver, if you’re not using the very latest driver for that GPU, a driver update may be worth a look.

Multiple GPUs in the system will affect this as will overall memory size.

This is all just hand-waving. Without a careful analysis I think any of these and/or JIT could still be a factor.

sam_hawker · October 9, 2017, 1:50pm

I think I’m having the same issue but on Windows 7 64-bit.

I’m using a total of six GPUs (two GTX 1070 and four GTX TITAN X) and my system has 256GB of main memory. It takes about 0.8 seconds to create a single context and if I try to create contexts in parallel on different threads then it seems to get serialized anyway. As a result it takes about 5 seconds to create all six contexts.

I don’t suppose there is any way to disable UVA in 64-bit applications?

njuffa · October 9, 2017, 3:09pm

I am reasonably sure that UVA is tightly integrated into CUDA and cannot be disabled. I would expect txbob to speak up if I am incorrect in that assessment.

I have never used a Windows system this large, but 0.8 seconds (= 800 milliseconds) for context creation sounds plausible. The overhead is correlated to host system performance and much of the context creation and initialization work presumably consists of non-parallelizable OS calls that therefore correlate with single-thread CPU performance, possibly also with system memory throughput.

That plays into my standing recommendation to use CPUs with modest core counts and very high single-thread performance (> 3.5 GHz base frequency) for systems running GPU-accelerated software.

CudaaduC · October 10, 2017, 12:47am

I recently commented on this general topic (though I was specifically looking at the time to initialize the cuBLAS or cuFFT library).

[url]https://devtalk.nvidia.com/default/topic/1022896/varying-cublas-initialization-times-across-different-os-configurations/#5204766[/url]

What I found is that this is not specific to Windows and actually Ubuntu 14.04 x64 LTS had a longer initialization time than Windows 8.1 x64. This is for an almost identical system with 64 GB DDR4 CPU ram and 2 GTX 1080ti GPUs, and actually the Ubuntu system has a bit better hardware.

If you are not using any of the CUDA libraries then just the CUDA initialization is not too long, but it is those libraries which really seem to have a long initialization time.

As njuffa mentioned it also seems to be related to the amount of system memory (host) and for Sam that 256 GB probably is the culprit.

Topic		Replies	Views
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23168	July 8, 2011
Varying cuBlas initialization times across different OS configurations GPU-Accelerated Libraries	7	899	October 10, 2017
CUDA initialization takes long time that varies up to 30 seconds on Amazon p3.16xlarge Windows machi... CUDA Programming and Performance	5	1590	December 8, 2019
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20150	May 4, 2007
Slow compile and cudaMalloc CUDA Programming and Performance	8	3701	February 2, 2011
Time measurement, callbacks, and IPC CUDA Programming and Performance	7	18554	July 17, 2007
First cudaMalloc() takes long time? CUDA Programming and Performance	13	17136	April 23, 2021
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1918	January 12, 2019
Really slow nvidia-smi, cuda initialization or context creation (L40) CUDA Programming and Performance	6	715	August 8, 2024
Slow CUDA programs' startup CUDA Programming and Performance	10	7265	January 23, 2012

64 bit Windows 10, gtx 1060, CUDA kernel startup time?

Related topics