Is there any possibility to create constexpr CUDA resource allocation?

tugrul_192bit · October 17, 2024, 10:53am

At work, I am expected to complete just 1 task (1 kernel execution) quicker than cudaMalloc lazy-initialization overhead. Is there any possibility to reduce cudaMalloc run-time overhead from 200ms to maybe 1 ms? My work is simple: allocate few kB free space with cudaMalloc, execute kernel, cudaFree. All work good but the first CUDA API call does the lazy management and makes sometimes even up to 500ms.

If there’s no constexpr-way of optimizing this, how can I tell the Nvidia run-time (perhaps by environment variables?) to not allocate 64MB of process memory? I’m only using 4kB memory with a lot of calculations that take only few milliseconds.

The problem exists in both Windows-11 RTX4070 device and Colab Ubuntu T4 device.

Thanks in advance.

striker159 · October 17, 2024, 1:12pm

There is no way around the initialization of the internal CUDA state.

njuffa · October 17, 2024, 8:41pm

To first order, the performance of the CUDA memory allocation functions has nothing to do with the GPU used, but is mainly a function of single-thread CPU performance and operating system. On Windows with WDDM driver in particular, GPU memory allocation is forced to go through the operating system.

CUDA context initialization time is a function of (1) total memory in the system (host + GPUs combined), as this needs to be mapped into a unified address space (2) single-threaded CPU performance with minor influence of system memory performance

To minimize CUDA initialization time, choose a bare-metal system (that is, one without virtualization) that has a frequency-optimized CPU with high memory bandwidth (e.g. AMD EPYC 9174F) running Linux, with as little GPU memory and system memory as you can get away with.

You can trigger CUDA context initialization at a place that is convenient to you by calling cudaFree(0) prior to any other CUDA activity.

GPUs are designed as throughput machines, while CPUs are primarily designed for low latency processing, although in recent years there is a trend to emphasize CPU throughput more than has been the case historically. For latency sensitive work (example: high-frequency trading), a CPU running at very high clock rates is often the most suitable platform. CPUs with up to 6 GHz boost clock are available at this time.

system · October 31, 2024, 8:41pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CudaMalloc is too expensive and GPU Memories CUDA Programming and Performance	6	2739	January 22, 2016
cudaMalloc execution time CUDA Programming and Performance	2	20	December 16, 2024
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1012	April 12, 2021
cudaMalloc takes several seconds CUDA Programming and Performance	6	2494	August 13, 2013
cudamalloc slow CUDA Programming and Performance	5	8219	November 13, 2015
Persistence Daemon and Slow Initialization CUDA Programming and Performance	1	1099	December 18, 2018
direct access to the memory device CUDA Programming and Performance	4	1058	April 10, 2012
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23136	July 8, 2011
CUDA initialization very slow on GeForce GTX 465 Initialization takes 1-4 seconds on GeForce GTX 4 CUDA Programming and Performance	4	4187	November 22, 2012
64 bit Windows 10, gtx 1060, CUDA kernel startup time? CUDA Programming and Performance	12	2835	October 10, 2017

Is there any possibility to create constexpr CUDA resource allocation?

Related topics