Is there any possibility to create constexpr CUDA resource allocation?

At work, I am expected to complete just 1 task (1 kernel execution) quicker than cudaMalloc lazy-initialization overhead. Is there any possibility to reduce cudaMalloc run-time overhead from 200ms to maybe 1 ms? My work is simple: allocate few kB free space with cudaMalloc, execute kernel, cudaFree. All work good but the first CUDA API call does the lazy management and makes sometimes even up to 500ms.

If there’s no constexpr-way of optimizing this, how can I tell the Nvidia run-time (perhaps by environment variables?) to not allocate 64MB of process memory? I’m only using 4kB memory with a lot of calculations that take only few milliseconds.

The problem exists in both Windows-11 RTX4070 device and Colab Ubuntu T4 device.

Thanks in advance.

There is no way around the initialization of the internal CUDA state.

2 Likes

To first order, the performance of the CUDA memory allocation functions has nothing to do with the GPU used, but is mainly a function of single-thread CPU performance and operating system. On Windows with WDDM driver in particular, GPU memory allocation is forced to go through the operating system.

CUDA context initialization time is a function of (1) total memory in the system (host + GPUs combined), as this needs to be mapped into a unified address space (2) single-threaded CPU performance with minor influence of system memory performance

To minimize CUDA initialization time, choose a bare-metal system (that is, one without virtualization) that has a frequency-optimized CPU with high memory bandwidth (e.g. AMD EPYC 9174F) running Linux, with as little GPU memory and system memory as you can get away with.

You can trigger CUDA context initialization at a place that is convenient to you by calling cudaFree(0) prior to any other CUDA activity.

GPUs are designed as throughput machines, while CPUs are primarily designed for low latency processing, although in recent years there is a trend to emphasize CPU throughput more than has been the case historically. For latency sensitive work (example: high-frequency trading), a CPU running at very high clock rates is often the most suitable platform. CPUs with up to 6 GHz boost clock are available at this time.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.