First cudaMalloc() takes long time?

God, I’m lost with the nuances of Cuda…

My CUDA routine is currently running 10x slower than my C++ routine.

I’ve narrowed down the problem to the first cudaMalloc(), which takes 192ms for 128k float points. Whereas, the next two cudaMallocs() for 128k cufftComplex points take 0.1ms.

I guess I’m clearly missing something obvious, but I’m currently at a loss to what that is…

float *Td;

	int size = ary_sz * sizeof(float);

CUDA_SAFE_CALL(cudaMalloc((void**)&Td, size));  //first malloc() takes 192ms for 128k float pts

cudaMemcpy(Td, timeDomain, size, cudaMemcpyHostToDevice);

	// Allocate device memory for signal

    cufftComplex *d_signal, *dout;

    int mem_size = ary_sz * sizeof(cufftComplex);

    CUDA_SAFE_CALL(cudaMalloc((void**)&d_signal, mem_size));

    CUDA_SAFE_CALL(cudaMalloc((void**)&dout, mem_size));     // both of these mallocs() take 0.1ms combined
1 Like

The first CUDA run time call – initializes the CUDA sub-system and takes some time to complete. Subsequent CUDA calls will take less time. This is dociumented in the manual.

Therez an explicit call, I guess, to initialze the CUDA RunTime. If you issue it first then all your cudaMallocs() would take almost the same time.

btw. 2 points:

  1. Issuing cudaMalloc in a FOR loop would take lot of time. Instead issue one big fat allocation and divide it according to your needs

  2. Some people do NOT consider the setup time (like cudaMalloc, memCpy) etc… while comparing performances. Only the time spent in the GPU kernel is calculated and compared.

Good Luck

Best Regards
Sarmath

Thanks for the explanation Sarnath. I believe I found the section in the manual:

But, this is very bad news!

Because the entire procedure I’m seeking to optimize with CUDA only takes 20ms in C++! The CUDA startup time alone is 10x longer!!

My goal is to optimize small array manipulations (1K - 128K), that consist of FFTs and other serial math functions. In my line of work, 20ms is a long time! If I could perform the array math in parallel-fashion and gain a 5/10/20x speed improvement, that would be quite an achievement.

These math functions will need to be callable as part of a larger external program, so I’d need to compile them separately into our Operating System.

So, my questions for nVidia or the experts:

– Is there any way to bury this 200ms hit one time, in the program load? [In my case, the “program” is a larger, external program (customer specific) that would call these CUDA functions from a DLL]

– Is there an Init() function that could absorb the 200ms hit, or do I need a dummy malloc() call, for example? The manual seems to indicate there is no Init() function.

– Does nVidia have any plans to address/eliminate the initialization time?

– How is everyone else dealing with this init time? If I write some CUDA Photoshop Filters, for example, would every separate filter action take (200ms + execution time), or does only the first call get the 200ms hit?

1 Like

You can do the init in the first hit. Just have your client call an libraryInit() function or something, and put a function that does the device init in there. I do this, and it works.

You can try to call CUT_DEVICE_INIT() to see if that does this initialisation procedure, instead of a dummy malloc.

Edit: Btw - you don’t NEED to go the libraryInit() way. The first malloc you do will be the only one to have the time hit. You CAN do it the libraryInit() way to have the time hit in a specific place in the execution.

It’s only when you unload the CUDA connection and/or exit the process that the device is “disconnected”, after which you’ll have another hit when you malloc.

Stupid flow chart:

Load CUDA library
One slow malloc
Infinite fast mallocs
Unload CUDA library

Interesting to know that the call took 20ms. Thats crazy. Lemme go check in my code… Hoewver, my stuff runs in minutes… so, 20ms is acceptable to me…

btw,
Is this happening only during the first execution of your EXE file OR Does it happen on every execution of your EXE file?

The first time you execute EXE, I have found it (general) to be slow. because of page-faults, cache-misses(both L1,L2 as well as operating system’s disk cache) etc… SUbsequent execution attempts have been observed to be faster. Can you check on those lines?

Gaborone

Hi Sarnath,

200ms – yes, for the first cudaMalloc() in the app.

It happens every time I run the Release mode from VC 2005, not just the first time.

I’m running the program in Release mode from VC – I don’t have a standalone EXE in my Release directory, not sure which switches I need to add to the project file to make that appear…

sorry. I dont understand. If you dont hav an EXE , how do you run it ?

I’ve been running the App straight from Visual Studio (F5 key), in Release mode.

After some more tests, running the EXE yields slightly less CUDA init overhead.

----- CUDA init overhead ------

Run from VC 2005:  189ms

Run from EXE:       54ms

These Init overhead times occur at every run. So once the program exits, the device needs to be initialized again.

[BTW, I’m timing the first malloc() statement in the application]

if 54 ms is significant for the running time of your complete application then you are trying to speed up something that is not worth speeding up maybe?
The initialization overhead of starting an exe is maybe already higher than that.

I agree. The OS has to read the exe from the disk, read shared libraries, link shared library function calls, map the programs memory space into virtual memory, and more.

Hi guys,

Well, 54ms is not typical for the runtime of my whole application. My whole (external) application could take 3s, for example.

However, a “long” calculation inside my application could take up to 20ms (yes, 20ms is a long time in my world) – and that’s what I’m testing now with CUDA: how effective CUDA is at speeding up a “one-shot” calculation that takes 20ms serially with a Core2Duo. [When I say “one-shot”, I mean the data entering that calculation is different every 3s, so the calculation must happen on it’s own – i.e. I can’t bundle together the 20ms calculations]

The calculation I’m currently evaluating consists of:
– a 64K 1D float array
– 2 FFTs (R2C/C2R)
– an array swap of the complex data (swap [0] & [1] dimensions)
– and a moving average filter

The array swap and moving average filter could theoretically be done entirely in parallel, if there were no hardware constraints, or at least heavily parallelized in a block/thread environment.

My external application would continue running in 3s chunks – so perhaps the CUDA initialization hit would only happen once, on the very first run(?) If so, that removes one hurdle.

However, if CUDA is not suited to speeding up a one-shot, serial 20ms calculation because of memory/kernel overhead, then it doesn’t help me in my case (and that frees me to buy Mass Effect and use my graphics card for something else) :)

Any thoughts?

It should be OK.

(In my mind, the only chance of it not working is if you unload any DLL’s or “kill” the CUDA context every 3 seconds.)

just got around to play with my first CUDA program(following the typical hello world example:) after years of wondering if/when/how to jump on the GPU bandwagon.

yes, I’m see 120ms “startup” overhead. The simple 1million array add program took 24ms on CPU and 300+ms on GPU(V100) with 260ms spent in cudaMalloc.

obvioously I need to find a real world problem to solve in order “bury” the startup overhead. Even so, 200ms is a lot of time and should not be lightly ignored.

Is cudaMalloc blocking? Ideally, I’d call ahead and run through whatever preparation workload on CPU before reaching the heavy number crunching code.

Since CUDA uses lazy context initialization and cudaMalloc() is often the first CUDA API call in a program, the first call to cudaMalloc incurs all the overhead of context creation, much of which has to do with generating a unified virtual memory map of the entire system memory and all GPU memory. You can trigger CUDA context creation at a point you prefer by a call cudaFree(0).

As a very rough rule of thumb for CUDA context creation, assume initialization cost of 4ms per GB mapped on a high-speed system. This is pretty much all single-threaded operating system activity, so it should run faster on systems with high single-thread CPU performance. I would recommend a CPU with base clock >= 3.5 GHz.

Since a program cannot safely use memory before it is allocated, memory allocation is by its nature a synchronous activity: control returns to the calling program once the allocation has been completed. There is a non-trivial cost to allocating and de-allocating GPU memory, so it is a good practice to minimize these activities in a CUDA-based application, for example by re-using previously made allocations.