Slow compile and cudaMalloc

I’m working on a reasonably large CUDA program (around 1700 lines of code), and I’ve been having some troubles with compile time and execution time. The foundation is a CUDA implementation of differential evolution that I wrote a few months ago and I’ve been very happy with how it has performed so far. Recently, my supervisor asked me to swap out the old objective function (the function to be optimized) and see how the program works with a large optimization problem related to robot kinematics. This objective function is quite large and fairly complicated (easily makes up most of the 1700 lines of code mentioned above), and I’ve been having troubles ever since I started implementing it.

Whereas my compile time previously was around a second, it now takes about 100 seconds to compile with the new objective function (about 25 seconds on the “be” command, and 75 seconds on “ptxas”). I don’t know if this would be considered reasonable, but it seems a little high to me. If anyone knows more about the compilation process than me, can you provide some insight?

Second (and more importantly), the memory allocation aspect of my host code is extremely slow. In particular, it’s the first call to cudaMalloc that is slowing everything down. I’m aware that the first call to cudaMalloc does some initialization of the GPU if it hasn’t already been done, but this seems excessive to me. Previously I’ve measured the first cudaMalloc call to take about 150-200 milliseconds, but now it takes over 130 seconds (about 1000 times longer than what I’ve previously seen). I’m not allocating any large amount of memory (only a few MB for my test trials), but this seems kind of ridiculous to me. Does anyone know what might be causing this? What exactly is going on during the initialization and how can I reduce the time it takes?

I hesitate to post my code right now because there’s so much of it, but I’d be open to doing that if necessary.

Given that both the “be” and “ptxas” execution times are quite lengthy, my best guess is that your original code is expanded into a much lengthier intermediate version, due to either loop unrolling or function inlining. Most of CUDA’s math functions and intrinsics will be expanded inline, for example. You could use the -keep option to dump the intermediate files to check how many lines of PTX your code expands to (the .ptx file). If you find that the generated code is fairly small (a few thousand lines of PTX) the long compile times are probably something we should look into.

The compiler is pretty much compute bound unless file access is very slow, so in general compilation times are strongly correlated to the speed of the system you are compiling on.

Thanks for the response, njuffa. The .ptx file is over 100k lines, so I guess that answers my question about the long compile time! Do you know anything about the initialization time for CUDA devices? I mean, is there any relationship between the size of the .ptx file and how long it would take to do that first initialization when executing the program? As it stands, my program spends about 99% of its time trying to execute the first cudaMalloc function, and the remaining 1% actually doing the calculations I want it to do. Alternatively, is there any documentation available that goes into detail about the initialization process?

These cudaMalloc times are much longer that whatever I’ve ever seen. If I were you, I’d start dissecting your code to obtain the shortest possible version, that still spends as much time malloc’ing.

My vague memory of CUDA compile and run-time architecture is as follows: you use nvcc to compile your kernel code into an intermediate device-independent binary form. When you execute your program, this intermediate device code gets compiled by the driver into its final binary form and that final code gets executed on the device. Since your code takes so long to compile, maybe that second-stage compilation causes the delay you’re seeing. As far as I understand, you can get nvcc to compile the code all the way into the final binary form by using `arch’ parameter to nvcc.

Good luck!

Regarding the slow time for cudaMalloc(): A number of people have noticed that recent NVIDIA drivers have a very slow CUDA initialization time if the card has been idle. You can mitigate this problem by running nvidia-smi in the background to keep the card state from being unloaded. This post gives the command line:

Yup, that took care of it! Just to reiterate what I’ve gotten out of this, using a command like this:

nvcc -o main

doesn’t necessarily compile to a full binary to run on the GPU. If it doesn’t, then the full binary will be compiled at runtime, which would clearly take a long time for such a large amount of PTX code. That’s what was happening here, and I fixed it by specifying the architecture for which to compile:

nvcc -arch=sm_20 -o main

The “sm_20” means that it should be compiled for a CUDA device with 2.0 capability. Thanks for your help, everyone!

It’s important to me that this code is able to run properly on different systems with different GPUs, so I made a little program to get the CUDA capability of the local device, which can be used with a Makefile to ensure that the appropriate architecture is used for compilation. I named this file, and I keep it in the same directory as the rest of my code.

#include <stdio.h>

#include <cuda.h>

int main(void)


    int deviceCount = 0;

    if (cudaGetDeviceCount(&deviceCount) != cudaSuccess) {

        fprintf(stderr, "cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.\n");



if (deviceCount == 0) {

        fprintf(stderr, "There is no device supporting CUDA\n");



int dev;

    for (dev = 0; dev < deviceCount; ++dev) {

        cudaDeviceProp deviceProp;

        cudaGetDeviceProperties(&deviceProp, dev);

fprintf(stdout, "-arch=sm_%d%d", deviceProp.major, deviceProp.minor);




My Makefile looks like this:


        nvcc $(shell ./arch_check) -o main


        nvcc -o arch_check


        rm -f *.o main arch_check

In order to compile the code, I first run “make arch_check”, followed by “make”. The first command compiles so that the device properties can be read, and the second command runs the arch_check executable and the results specify the architecture for which to compile the main program.

To build the maximum flexibility into a fat binary one wants pre-built machine code for all architectures of interest plus the PTX needed for JITing to future architectures. The way to achieve this is with one -gencode switch per target architecture, configured to generate both PTX and machine code. For example, to build a double precision application for sm_13, sm_20, and sm_21 in this fashion, one would use

-gencode arch=compute_13,\"code=sm_13,compute_13\" -gencode arch=compute_20,\"code=sm_20,compute_20\" -gencode arch=compute_21,\"code=sm_21,compute_21\"

Glad it worked out for you.

I don’t think this extra architecture check is necessary. You can supply several different arch flags in your nvcc command line at once, and you’ll have several versions of the GPU code, one per architecture complied into your linux executable. My nvcc command line contains:

-gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20

This seems to be a simpler solution. The compilation time nearly doubles, of course. Also, note that the build machine is not necessarily the run machine. In that case you cannot possibly figure out the architecture at the compile time.

Ha! Our two posts are quite similar and have been submitted less than a minute apart!