I’m working on a reasonably large CUDA program (around 1700 lines of code), and I’ve been having some troubles with compile time and execution time. The foundation is a CUDA implementation of differential evolution that I wrote a few months ago and I’ve been very happy with how it has performed so far. Recently, my supervisor asked me to swap out the old objective function (the function to be optimized) and see how the program works with a large optimization problem related to robot kinematics. This objective function is quite large and fairly complicated (easily makes up most of the 1700 lines of code mentioned above), and I’ve been having troubles ever since I started implementing it.
Whereas my compile time previously was around a second, it now takes about 100 seconds to compile with the new objective function (about 25 seconds on the “be” command, and 75 seconds on “ptxas”). I don’t know if this would be considered reasonable, but it seems a little high to me. If anyone knows more about the compilation process than me, can you provide some insight?
Second (and more importantly), the memory allocation aspect of my host code is extremely slow. In particular, it’s the first call to cudaMalloc that is slowing everything down. I’m aware that the first call to cudaMalloc does some initialization of the GPU if it hasn’t already been done, but this seems excessive to me. Previously I’ve measured the first cudaMalloc call to take about 150-200 milliseconds, but now it takes over 130 seconds (about 1000 times longer than what I’ve previously seen). I’m not allocating any large amount of memory (only a few MB for my test trials), but this seems kind of ridiculous to me. Does anyone know what might be causing this? What exactly is going on during the initialization and how can I reduce the time it takes?
I hesitate to post my code right now because there’s so much of it, but I’d be open to doing that if necessary.