CudaMalloc taking very long

I’m using a GPU-GRMES program to process very large sparse matrices.

For context, I’m working with a 180k by 180k matrix with 5.6mil non-zero numbers.

This program compiles very long as it has to link with many libraries and when it runs, it takes altogether 212 seconds for the first cudamalloc. My research suggests that this is due to three factors:

  1. Driver Initialization
  2. PTX compilation
  3. Context creation

I wonder if there is any other reasons that caused it.

Nonetheless, is there any way to overcome this issue?

Thank you.

Your three-item list is a good summary.

(1) Driver initialization: Make sure you make the driver persistent, if you haven’t done that already. Long driver initialization times are frequently seen on systems with very large memory (both CPU and GPU), as the driver needs to map all GPU and system memory into a single virtual memory map.

(2) PTX compilation: Best practice is to create fat binaries that embed SASS for all architectures you intend to support, and one PTX version (for the latest architecture supported by CUDA (for forward compatibility with future GPU architectures). Dynamic PTX generation and compilation should be used only when absolutely necessary, e.g. some in-memory databases for GPUs compile queries into custom kernels created on the fly.

Note that most of the overhead enumerated in your list is CPU (host-side) work, and much of it is single-threaded to boot. For high-performance systems using GPU acceleration, I therefore recommend CPUs with high single-thread performance. At this time, I’d say that means base frequency >= 3.5 GHz.

Thank you for your reply, Njuffa.

  1. I’m running on Windows Visual Studio 2013. How do I get it to persistent mode? I read up on that issue and what I found was this article from Nvidia:

“On Windows the kernel mode driver is loaded at Windows startup and kept loaded until Windows shutdown.” I haven’t done persistence mode before so I may have missed something.

  1. I’ll look into it. Thanks for the tip.

Thank you again and I look forward to your reply for part 1.

If you are on Windows, the driver is automatically persistent, as the documentation states. On Linux, the driver will be unloaded when not in use unless placed in persistent mode. I had incorrectly assumed that you are on Linux.

How big is your memory? 212 seconds is extraordinarily long, even for machines with tons of memory. Are you running Windows 10? My best guess at this time is JIT overhead.

In practical terms, it might help to trigger the startup overhead at a more convenient point in the application by calling cudaFree(0).

I have 16GB RAM and running with 6 Intel CPUs and a Tesla K40C GPU. I concur that it’s the JIT overhead too as I’ve read about it. It’s highly likely has to do with the shear amount of compilations of the libraries linked to it?

In regards to the CudaFree(0), I’ve tested and agree that might be a quick fix but that doesn’t really address the problem that it takes more total time when it’s used to challenge CPU timings for the same problem.

If you’ve any advise to overcome the JIT compilation, I’m willing to learn. Thanks again!

Yeah, with only 16 GB of system memory and one GPU it is definitely not delay from mapping memory.

The CUDA documentation (the compiler guide in particular) explains everything you need to know about creating fat binaries for multiple GPU architecture targets.

For now, simply try building for your specific GPU (K40c) by passing command line flag -arch=sm_35 to nvcc. If you do not specify an architecture on the nvcc command line, the compiler default to an sm_20 build target. The code then needs to be JIT compiled for sm_35 at startup.

Excellent: Passing command line flag -arch=sm_35 to nvcc does overcome the JIT issue. This is something new to me, why does passing the architecture specification makes such a difference and why can’t it be all aligned to the same architecture?

Please bear with me as I’m still have some finer details of CUDA programming to learn. Thank you for your help.

it’s not an x86 world. each nvidia gpu generation has its own architecture, slightly incompatible with others. so you need to cpmpile for specific gpu generation. executabble usually includes final code for a few architectures (at your discretion) and generic higher-level code that jit-compiled to proper architecture if there is no precompiled code that fits it

I see, thanks for the response. I’ll keep that in mind.

Just to clarify: The differences between some GPU architectures are fairly minor, while there are major differences between others e.g. sm_20 vs sm_30 and sm_30 vs sm_50.

Side question - may not be relevant.
Is your persistence mode set to 1?

If I was on Linux, I’d have done it. But since I’m working on windows, it won’t be applicable… my main problem was that the architecture was now aligned which caused the delay in the execution.