Slow compile and cudaMalloc

bwood · January 29, 2011, 7:14pm

I’m working on a reasonably large CUDA program (around 1700 lines of code), and I’ve been having some troubles with compile time and execution time. The foundation is a CUDA implementation of differential evolution that I wrote a few months ago and I’ve been very happy with how it has performed so far. Recently, my supervisor asked me to swap out the old objective function (the function to be optimized) and see how the program works with a large optimization problem related to robot kinematics. This objective function is quite large and fairly complicated (easily makes up most of the 1700 lines of code mentioned above), and I’ve been having troubles ever since I started implementing it.

Whereas my compile time previously was around a second, it now takes about 100 seconds to compile with the new objective function (about 25 seconds on the “be” command, and 75 seconds on “ptxas”). I don’t know if this would be considered reasonable, but it seems a little high to me. If anyone knows more about the compilation process than me, can you provide some insight?

Second (and more importantly), the memory allocation aspect of my host code is extremely slow. In particular, it’s the first call to cudaMalloc that is slowing everything down. I’m aware that the first call to cudaMalloc does some initialization of the GPU if it hasn’t already been done, but this seems excessive to me. Previously I’ve measured the first cudaMalloc call to take about 150-200 milliseconds, but now it takes over 130 seconds (about 1000 times longer than what I’ve previously seen). I’m not allocating any large amount of memory (only a few MB for my test trials), but this seems kind of ridiculous to me. Does anyone know what might be causing this? What exactly is going on during the initialization and how can I reduce the time it takes?

I hesitate to post my code right now because there’s so much of it, but I’d be open to doing that if necessary.

njuffa · January 29, 2011, 8:02pm

Given that both the “be” and “ptxas” execution times are quite lengthy, my best guess is that your original code is expanded into a much lengthier intermediate version, due to either loop unrolling or function inlining. Most of CUDA’s math functions and intrinsics will be expanded inline, for example. You could use the -keep option to dump the intermediate files to check how many lines of PTX your code expands to (the .ptx file). If you find that the generated code is fairly small (a few thousand lines of PTX) the long compile times are probably something we should look into.

The compiler is pretty much compute bound unless file access is very slow, so in general compilation times are strongly correlated to the speed of the system you are compiling on.

bwood · January 30, 2011, 2:57am

Thanks for the response, njuffa. The .ptx file is over 100k lines, so I guess that answers my question about the long compile time! Do you know anything about the initialization time for CUDA devices? I mean, is there any relationship between the size of the .ptx file and how long it would take to do that first initialization when executing the program? As it stands, my program spends about 99% of its time trying to execute the first cudaMalloc function, and the remaining 1% actually doing the calculations I want it to do. Alternatively, is there any documentation available that goes into detail about the initialization process?

cudesnick · February 1, 2011, 1:46am

These cudaMalloc times are much longer that whatever I’ve ever seen. If I were you, I’d start dissecting your code to obtain the shortest possible version, that still spends as much time malloc’ing.

My vague memory of CUDA compile and run-time architecture is as follows: you use nvcc to compile your kernel code into an intermediate device-independent binary form. When you execute your program, this intermediate device code gets compiled by the driver into its final binary form and that final code gets executed on the device. Since your code takes so long to compile, maybe that second-stage compilation causes the delay you’re seeing. As far as I understand, you can get nvcc to compile the code all the way into the final binary form by using `arch’ parameter to nvcc.

Good luck!

seibert · February 1, 2011, 3:40pm

Regarding the slow time for cudaMalloc(): A number of people have noticed that recent NVIDIA drivers have a very slow CUDA initialization time if the card has been idle. You can mitigate this problem by running nvidia-smi in the background to keep the card state from being unloaded. This post gives the command line:

http://forums.nvidia.com/index.php?showtopic=185733&view=findpost&p=1152124

bwood · February 2, 2011, 3:48pm

Yup, that took care of it! Just to reiterate what I’ve gotten out of this, using a command like this:

nvcc -o main cudaDE.cu

doesn’t necessarily compile to a full binary to run on the GPU. If it doesn’t, then the full binary will be compiled at runtime, which would clearly take a long time for such a large amount of PTX code. That’s what was happening here, and I fixed it by specifying the architecture for which to compile:

nvcc -arch=sm_20 -o main cudaDE.cu

The “sm_20” means that it should be compiled for a CUDA device with 2.0 capability. Thanks for your help, everyone!

It’s important to me that this code is able to run properly on different systems with different GPUs, so I made a little program to get the CUDA capability of the local device, which can be used with a Makefile to ensure that the appropriate architecture is used for compilation. I named this file arch_check.cu, and I keep it in the same directory as the rest of my code.

#include <stdio.h>

#include <cuda.h>

int main(void)

{

    int deviceCount = 0;

    if (cudaGetDeviceCount(&deviceCount) != cudaSuccess) {

        fprintf(stderr, "cudaGetDeviceCount FAILED CUDA Driver and Runtime version may be mismatched.\n");

        exit(-1);

    }

if (deviceCount == 0) {

        fprintf(stderr, "There is no device supporting CUDA\n");

        exit(-1);

    }

int dev;

    for (dev = 0; dev < deviceCount; ++dev) {

        cudaDeviceProp deviceProp;

        cudaGetDeviceProperties(&deviceProp, dev);

fprintf(stdout, "-arch=sm_%d%d", deviceProp.major, deviceProp.minor);

}

return(0);

}

My Makefile looks like this:

all:

        nvcc $(shell ./arch_check) -o main cudaDE.cu

arch_check:

        nvcc -o arch_check arch_check.cu

clean:

        rm -f *.o main arch_check

In order to compile the code, I first run “make arch_check”, followed by “make”. The first command compiles arch_check.cu so that the device properties can be read, and the second command runs the arch_check executable and the results specify the architecture for which to compile the main program.

njuffa · February 2, 2011, 10:30pm

To build the maximum flexibility into a fat binary one wants pre-built machine code for all architectures of interest plus the PTX needed for JITing to future architectures. The way to achieve this is with one -gencode switch per target architecture, configured to generate both PTX and machine code. For example, to build a double precision application for sm_13, sm_20, and sm_21 in this fashion, one would use

-gencode arch=compute_13,\"code=sm_13,compute_13\" -gencode arch=compute_20,\"code=sm_20,compute_20\" -gencode arch=compute_21,\"code=sm_21,compute_21\"

cudesnick · February 2, 2011, 10:30pm

Glad it worked out for you.

I don’t think this extra architecture check is necessary. You can supply several different arch flags in your nvcc command line at once, and you’ll have several versions of the GPU code, one per architecture complied into your linux executable. My nvcc command line contains:

-gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20

This seems to be a simpler solution. The compilation time nearly doubles, of course. Also, note that the build machine is not necessarily the run machine. In that case you cannot possibly figure out the architecture at the compile time.

cudesnick · February 2, 2011, 10:32pm

Ha! Our two posts are quite similar and have been submitted less than a minute apart!

Topic		Replies	Views
CudaMalloc taking very long CUDA Programming and Performance	11	1197	April 27, 2017
why adding 1 line =exploding time to compile CUDA Programming and Performance	13	8448	June 8, 2009
64 bit Windows 10, gtx 1060, CUDA kernel startup time? CUDA Programming and Performance	12	2842	October 10, 2017
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2364	September 3, 2010
very slow compile CUDA Programming and Performance	7	2270	February 8, 2012
Generate CUDA at run-time ? CUDA Programming and Performance	13	3066	September 28, 2011
Difference in Performance CUDA Programming and Performance	13	9738	August 20, 2008
Is dynamic code generation possible? CUDA Programming and Performance	20	4546	December 29, 2012
First cudaMalloc() takes long time? CUDA Programming and Performance	13	17125	April 23, 2021
The cost of Relocatable Device Code (-rdc=true) CUDA Programming and Performance	11	15127	September 11, 2023

Slow compile and cudaMalloc

Related topics