64 bit Windows 10, gtx 1060, CUDA kernel startup time?

So, I’ve never really noticed this before but now whenever I run some CUDA code, there’s this really hard sort of lag that I hit. It seems to happen no matter what my CUDA code is actually doing. I think this may have something to do with a recent update to Windows 10.

I’m really just wondering, has anyone else doing dev on Windows noticed this?

I know the rule of thumb is to always post a sample of working code but I’m at work and I was testing with my home project. This is a link to the project. Shifting the CMakeLists.txt files in the debug and release directories shouldn’t be too bad and they also contain my simple main.cu files.

In the case of my release build, I profiled it. And everything is profiling as it was before. It’s just there’s this weird start-up time that seems to show up no matter what. My GPU is fine though. I can play Dark Souls 3 and my GPU fans don’t even turn on.

Edit:

I’ll go home tonight and see if I can create the smallest possible replicable example that exhibits the behavior I’m experiencing. I honestly just think it’s any call to CUDA that gives me this weird stumble in performance.

Sorry for the late reply but even this code takes a decent chunk of time to execute:

// main.cu
int main(void)
{
  cudaDeviceSynchronize();
  return 0;
}

The good news is, I feel a lot less bad about my code being slow now.

Compiling and executing this code with nothing but a simple:

nvcc -o demo main.cu

generates an executable which takes around 2.239 seconds for me! I finally found out how to time a command in PowerShell :P

So it seems like any initial call to the CUDA API incurs significant overhead for me.

Edit:

It’s worth noting that I’m compiling this in the Visual Studio developer command prompt and I set the vcvarsall to x86_amd64 because I finally have enough VRAM to break the 4 GB barrier and I’m gonna use it!

CUDA is stateful, and the underlying state must be initialized. This initialization is usually triggered by the first call to any CUDA ABI.

In general, I would expect the initialization time to be in the hundreds of milliseconds, not multiple seconds. Past reports of long initialization times were usually on systems with very large system memory, and/or several GPUs each with large memory. I do not know the details, but part of CUDA initialization maps all device memory and all system memory into a single virtual address space, and the time for this initialization increases with the amount of memory that needs to be mapped. I believe the speed of this process is largely determined by the performance of operating system calls used to apply the mapping(s).

Note that by default, nvcc compiles for an sm_20 target, and that JIT compilation overhead will be incurred at CUDA startup time if your GPU has a different (higher!) compute capability, which is exceedingly likely at this time. This shouldn’t matter for minimal code like this, but it can bite you with real applications. Compile your code for all relevant compute capabilities, resulting in a fat binary.

njuffa, you’re always so awesome!

I’ll try cross-compiling my code as well. I do have an Ubuntu partition set up, I just haven’t used it yet because… well, videogames mostly.

If this is partially OS-dependent, then it’s fine for Windows to suck. I’m hoping they CUDA devs at Nvidia will eventually work around this but until then, this has definitely assuaged my worries.

I feel like GPUs went from relatively little memory to a lot of memory pretty quickly. Like, when I got my 2 GB 750, I was like, whoa. But then all of a sudden, AMD has a cards with 8 GB and the 1060 has 6 GB. I’m almost not prepared for this brave new world. I really regret hard-coding int as my default array index type now.

Even for the context creation that does feel slow. I bet njuffa is right and it’s a JIT pass.

But another thought: are you running an Optimus laptop? Perhaps when you start the program the NVidia GPU is turned off entirely and Windows is running only the processor’s IGP. Starting a CUDA kernel would then require powering on the whole NVidia GPU hardware itself up as well, which likely takes more time than mere CUDA runtime initialization.

Try opening a WebGL page to make sure the NVidia GPU is running, leaving it open, then run your program again.

The IGP/NVidia switching is an inelegant pain in Mac OSX laptops like the MacBook Pro… the CUDA device is not even queryable when the IGP is being used, so you have to kickstart CUDA apps by opening an OpenGL window before creating the CUDA context first. After the context is created, it works fine.

While Windows 10 may be worse than other operating systems in terms of this memory mapping overhead (possibly due to WDDM 2.0), the issue generally exists on all operating systems supported by CUDA. I seem to recall that multiple previous reports of lengthy initialization time were for example Linux-based servers with on the order of 128 GB of memory and four or eight Tesla GPUs, requiring on the order of 256 GB of total memory to be mapped.

When encountering weird overhead issues on Windows, one should first switch the GPU from the WDDM driver to the TCC driver, if possible (not possible with GTX 1060), and then also try Linux, if possible.

Note that using ‘int’ for array indexing is still able to access 8 GB per array for an array of ‘float’ or ‘int’ elements, so depending on the context of an application that may still be fine, even when using GPUs with large memory. On the other hand, for a general purpose library, where even larger arrays may be encountered, and the size is not under control of the library writer, indexing with ‘long long int’ seems appropriate.

As an update, I tried compiling with a specification for the 6.1 architecture and this is my output:

// nvcc -gencode arch=compute_61,code=sm_61 -o demo main.cu
PS C:\Users\christian\cuda\demo> Measure-Command { .\demo.exe }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 2
Milliseconds      : 217
Ticks             : 22177295
TotalDays         : 2.56681655092593E-05
TotalHours        : 0.000616035972222222
TotalMinutes      : 0.0369621583333333
TotalSeconds      : 2.2177295
TotalMilliseconds : 2217.7295

@SPWorley I am not running on a laptop. Right now I’m running a desktop that I built myself.

Oh snap, I figured it out!

It’s because I’m compiling as a 64 bit application!

njuffa, I think you’re completely correct about the memory mapping thing O_o

If I compile as a 32 bit executable, the code I posted runs in like 184ms compared to the 2s when I compile as a 64 bit application.

That’s an interesting lead, maybe txbob has some ideas about that. I haven’t built any 32-bit code in years.

64 bit turns on UVA, 32 bit turns it off.

UVA has implications for initialization of the virtual address space.

I think this (initialization) may have gotten slower in windows 10 also. Many win10 bugs are being fixed in the driver, if you’re not using the very latest driver for that GPU, a driver update may be worth a look.

Multiple GPUs in the system will affect this as will overall memory size.

This is all just hand-waving. Without a careful analysis I think any of these and/or JIT could still be a factor.

I think I’m having the same issue but on Windows 7 64-bit.

I’m using a total of six GPUs (two GTX 1070 and four GTX TITAN X) and my system has 256GB of main memory. It takes about 0.8 seconds to create a single context and if I try to create contexts in parallel on different threads then it seems to get serialized anyway. As a result it takes about 5 seconds to create all six contexts.

I don’t suppose there is any way to disable UVA in 64-bit applications?

I am reasonably sure that UVA is tightly integrated into CUDA and cannot be disabled. I would expect txbob to speak up if I am incorrect in that assessment.

I have never used a Windows system this large, but 0.8 seconds (= 800 milliseconds) for context creation sounds plausible. The overhead is correlated to host system performance and much of the context creation and initialization work presumably consists of non-parallelizable OS calls that therefore correlate with single-thread CPU performance, possibly also with system memory throughput.

That plays into my standing recommendation to use CPUs with modest core counts and very high single-thread performance (> 3.5 GHz base frequency) for systems running GPU-accelerated software.

I recently commented on this general topic (though I was specifically looking at the time to initialize the cuBLAS or cuFFT library).

https://devtalk.nvidia.com/default/topic/1022896/varying-cublas-initialization-times-across-different-os-configurations/#5204766

What I found is that this is not specific to Windows and actually Ubuntu 14.04 x64 LTS had a longer initialization time than Windows 8.1 x64. This is for an almost identical system with 64 GB DDR4 CPU ram and 2 GTX 1080ti GPUs, and actually the Ubuntu system has a bit better hardware.

If you are not using any of the CUDA libraries then just the CUDA initialization is not too long, but it is those libraries which really seem to have a long initialization time.

As njuffa mentioned it also seems to be related to the amount of system memory (host) and for Sam that 256 GB probably is the culprit.