My total code is taking 100MHz on CPU, I want to optimize the it further using CUDA programming. I just allocated sample memory on device using CudaMalloc, because of this code is taking 1.1GHz. How to overcome this issue with CudaMalloc first time?
CUDA uses lazy initialization of the CUDA context. That typically means that the first CUDA API call in a program will incur context initialization overhead. Often this is a call to
cudaMalloc(). If you want this startup cost to be incurred at some other place, issue a call
cudaFree(0) at that point.
One of the activities that occurs during initialization that has significant cost is to map all system memory and all GPU memory into a single unified virtual address space. The more total (CPU + GPU) memory your system has, the longer this will take. This activity consists almost entirely of single-threaded operating system activity, i.e. this is part of the serial portion of your application. Serial host code benefits from high single-thread CPU performance and fast system memory. I recommend a CPU with base frequency of > 3.5 GHz. Use the fastest speed grade of DDR4 your system is designed for and populate as many DDR4 channels as possible (e.g. high-end workstation: 4 channels, high-end server: 8 channels).
On a fast system as described, a very rough approximation is that 4 milliseconds of initialization time accrue per GB of memory mapped. You may be able to reduce the time needed for mapping of GPU memory by using the environment variable
CUDA_VISIBLE_DEVICES to hide some GPUs from CUDA.
If you are on Linux, when CUDA is not in use, the operating system will unload the driver. When you start a CUDA-accelerated application, there will be additional overhead to re-load the driver. Enable the persistence daemon to keep the driver loaded at all times. Relevant documentation can be found here: