CUDA initialization very slow on GeForce GTX 465 Initialization takes 1-4 seconds on GeForce GTX 4

dgwsoft · April 8, 2011, 11:33am

I have some test code that creates a thread, calls cudaSetDevice(), then does some cudaMalloc()s, timing how long they take.

The first cudaMalloc() in each thread takes a long time, while subsequent calls are fast.
That is understood: it is because the first call will cause a CUDA context to be initialized, and
also some overhead for the first use of cudaMalloc itself (see http://forums.nvidia.com/index.php?showtopic=158779).

On my ‘home’ system with 2 x GeForce GTX 465, the first cudaMalloc() takes 1-4 seconds.

On my ‘work’ system with 4 x Tesla 1060 (running the same code) it takes 70 milliseconds.

Does anyone have any idea why the 465 is so slow? Could it be some other aspect of my system making it slow?
Or is the Tesla unusually fast?

The end product may have to run on a whole range of hardware, from ‘PSCs’ to laptops with a single low-powered CUDA
device, so it may be important to know what I should expect from different devices.

avidday · April 8, 2011, 12:17pm

Are you running X11 on the GTX465 system?

dgwsoft · April 8, 2011, 12:44pm

Not on the GTX 465s. I also have a GeForce 210 that I run X on.

longlongfhl · November 22, 2012, 9:12pm

My first cudamalloc is also very slow~ wondering why

njuffa · November 22, 2012, 9:46pm

As the original poster mentioned above, the very first call to any CUDA API function triggers the creation of a CUDA context “under the hood”. A fair amount of work goes into context creation, so there will be a delay. A multi-second delay can happen under Linux when the kernel module needs to be loaded as part of the context creation process. To keep it resident, turn on persistence mode with nvidia-smi. Users encountering multi-second context creation delays despite using persistence mode should file a bug with a self-contained repro case, noting the exact platform configuration.

Often, cudaMalloc() is the first CUDA API call in a CUDA application and thus gets affected by the context creation delay. If that is inconvenient for some reason, context creation can be triggered by a call to cudaFree(0) prior to the first cudaMalloc().

Topic		Replies	Views
Help! First cudaMalloc takes 10 seconds! CUDA Programming and Performance	8	1585	February 11, 2012
cudaMalloc's taking different times CUDA Programming and Performance	3	1955	December 22, 2010
Is first cudaMalloc() will take more time? then how much? CUDA Programming and Performance	1	2955	April 15, 2009
cuda startup slow CUDA Programming and Performance	4	8437	March 6, 2009
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23261	July 8, 2011
Slow Initialization CUDA Programming and Performance	7	2781	July 30, 2009
slowness of first cudaMalloc (K40 vs K20) CUDA Programming and Performance	2	886	October 29, 2015
cudaMalloc hangs for several minutes on Titans on CentOS5_x64 CUDA Setup and Installation	6	3695	June 12, 2013
Long initialization time C1060 CUDA Programming and Performance	3	1194	August 6, 2009
slowness of first cudaMalloc (K40 vs K20) CUDA Programming and Performance	0	788	October 28, 2015

CUDA initialization very slow on GeForce GTX 465 Initialization takes 1-4 *seconds* on GeForce GTX 4

Related topics

CUDA initialization very slow on GeForce GTX 465 Initialization takes 1-4 seconds on GeForce GTX 4