CudaMalloc is taking huge time for first time, How to overcome this issue

kunasiramesh · April 12, 2021, 11:58am

My total code is taking 100MHz on CPU, I want to optimize the it further using CUDA programming. I just allocated sample memory on device using CudaMalloc, because of this code is taking 1.1GHz. How to overcome this issue with CudaMalloc first time?

njuffa · April 12, 2021, 7:28pm

CUDA uses lazy initialization of the CUDA context. That typically means that the first CUDA API call in a program will incur context initialization overhead. Often this is a call to cudaMalloc(). If you want this startup cost to be incurred at some other place, issue a call cudaFree(0) at that point.

One of the activities that occurs during initialization that has significant cost is to map all system memory and all GPU memory into a single unified virtual address space. The more total (CPU + GPU) memory your system has, the longer this will take. This activity consists almost entirely of single-threaded operating system activity, i.e. this is part of the serial portion of your application. Serial host code benefits from high single-thread CPU performance and fast system memory. I recommend a CPU with base frequency of > 3.5 GHz. Use the fastest speed grade of DDR4 your system is designed for and populate as many DDR4 channels as possible (e.g. high-end workstation: 4 channels, high-end server: 8 channels).

On a fast system as described, a very rough approximation is that 4 milliseconds of initialization time accrue per GB of memory mapped. You may be able to reduce the time needed for mapping of GPU memory by using the environment variable CUDA_VISIBLE_DEVICES to hide some GPUs from CUDA.

If you are on Linux, when CUDA is not in use, the operating system will unload the driver. When you start a CUDA-accelerated application, there will be additional overhead to re-load the driver. Enable the persistence daemon to keep the driver loaded at all times. Relevant documentation can be found here:

Driver Persistence :: GPU Deployment and Management Documentation (nvidia.com)

Topic		Replies	Views
cudaMalloc execution time CUDA Programming and Performance	2	46	December 16, 2024
cudamalloc slow CUDA Programming and Performance	5	8379	November 13, 2015
CudaMalloc is too expensive and GPU Memories CUDA Programming and Performance	6	2755	January 22, 2016
cudaMalloc takes several seconds CUDA Programming and Performance	6	2510	August 13, 2013
Help! First cudaMalloc takes 10 seconds! CUDA Programming and Performance	8	1505	February 11, 2012
Is there any possibility to create constexpr CUDA resource allocation? CUDA Programming and Performance	3	26	October 17, 2024
First cudaMalloc() takes long time? CUDA Programming and Performance	13	17173	April 23, 2021
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23177	July 8, 2011
Is cudaMalloc slow when called multiple times? CUDA Programming and Performance	3	164	July 5, 2024
Calculate time ? CUDA Programming and Performance	5	2810	November 23, 2008

CudaMalloc is taking huge time for first time, How to overcome this issue

Related topics