First cudaMalloc() takes long time?

God, I’m lost with the nuances of Cuda…

My CUDA routine is currently running 10x slower than my C++ routine.

I’ve narrowed down the problem to the first cudaMalloc(), which takes 192ms for 128k float points. Whereas, the next two cudaMallocs() for 128k cufftComplex points take 0.1ms.

I guess I’m clearly missing something obvious, but I’m currently at a loss to what that is…

float *Td;

	int size = ary_sz * sizeof(float);

CUDA_SAFE_CALL(cudaMalloc((void**)&Td, size));  //first malloc() takes 192ms for 128k float pts

cudaMemcpy(Td, timeDomain, size, cudaMemcpyHostToDevice);

	// Allocate device memory for signal

    cufftComplex *d_signal, *dout;

    int mem_size = ary_sz * sizeof(cufftComplex);

    CUDA_SAFE_CALL(cudaMalloc((void**)&d_signal, mem_size));

    CUDA_SAFE_CALL(cudaMalloc((void**)&dout, mem_size));     // both of these mallocs() take 0.1ms combined

The first CUDA run time call – initializes the CUDA sub-system and takes some time to complete. Subsequent CUDA calls will take less time. This is dociumented in the manual.

Therez an explicit call, I guess, to initialze the CUDA RunTime. If you issue it first then all your cudaMallocs() would take almost the same time.

btw. 2 points:

  1. Issuing cudaMalloc in a FOR loop would take lot of time. Instead issue one big fat allocation and divide it according to your needs

  2. Some people do NOT consider the setup time (like cudaMalloc, memCpy) etc… while comparing performances. Only the time spent in the GPU kernel is calculated and compared.

Good Luck

Best Regards
Sarmath

Thanks for the explanation Sarnath. I believe I found the section in the manual:

But, this is very bad news!

Because the entire procedure I’m seeking to optimize with CUDA only takes 20ms in C++! The CUDA startup time alone is 10x longer!!

My goal is to optimize small array manipulations (1K - 128K), that consist of FFTs and other serial math functions. In my line of work, 20ms is a long time! If I could perform the array math in parallel-fashion and gain a 5/10/20x speed improvement, that would be quite an achievement.

These math functions will need to be callable as part of a larger external program, so I’d need to compile them separately into our Operating System.

So, my questions for nVidia or the experts:

– Is there any way to bury this 200ms hit one time, in the program load? [In my case, the “program” is a larger, external program (customer specific) that would call these CUDA functions from a DLL]

– Is there an Init() function that could absorb the 200ms hit, or do I need a dummy malloc() call, for example? The manual seems to indicate there is no Init() function.

– Does nVidia have any plans to address/eliminate the initialization time?

– How is everyone else dealing with this init time? If I write some CUDA Photoshop Filters, for example, would every separate filter action take (200ms + execution time), or does only the first call get the 200ms hit?

You can do the init in the first hit. Just have your client call an libraryInit() function or something, and put a function that does the device init in there. I do this, and it works.

You can try to call CUT_DEVICE_INIT() to see if that does this initialisation procedure, instead of a dummy malloc.

Edit: Btw - you don’t NEED to go the libraryInit() way. The first malloc you do will be the only one to have the time hit. You CAN do it the libraryInit() way to have the time hit in a specific place in the execution.

It’s only when you unload the CUDA connection and/or exit the process that the device is “disconnected”, after which you’ll have another hit when you malloc.

Stupid flow chart:

Load CUDA library
One slow malloc
Infinite fast mallocs
Unload CUDA library

Interesting to know that the call took 20ms. Thats crazy. Lemme go check in my code… Hoewver, my stuff runs in minutes… so, 20ms is acceptable to me…

btw,
Is this happening only during the first execution of your EXE file OR Does it happen on every execution of your EXE file?

The first time you execute EXE, I have found it (general) to be slow. because of page-faults, cache-misses(both L1,L2 as well as operating system’s disk cache) etc… SUbsequent execution attempts have been observed to be faster. Can you check on those lines?

Gaborone

Hi Sarnath,

200ms – yes, for the first cudaMalloc() in the app.

It happens every time I run the Release mode from VC 2005, not just the first time.

I’m running the program in Release mode from VC – I don’t have a standalone EXE in my Release directory, not sure which switches I need to add to the project file to make that appear…

sorry. I dont understand. If you dont hav an EXE , how do you run it ?

I’ve been running the App straight from Visual Studio (F5 key), in Release mode.

After some more tests, running the EXE yields slightly less CUDA init overhead.

----- CUDA init overhead ------

Run from VC 2005:  189ms

Run from EXE:       54ms

These Init overhead times occur at every run. So once the program exits, the device needs to be initialized again.

[BTW, I’m timing the first malloc() statement in the application]

if 54 ms is significant for the running time of your complete application then you are trying to speed up something that is not worth speeding up maybe?
The initialization overhead of starting an exe is maybe already higher than that.

I agree. The OS has to read the exe from the disk, read shared libraries, link shared library function calls, map the programs memory space into virtual memory, and more.

Hi guys,

Well, 54ms is not typical for the runtime of my whole application. My whole (external) application could take 3s, for example.

However, a “long” calculation inside my application could take up to 20ms (yes, 20ms is a long time in my world) – and that’s what I’m testing now with CUDA: how effective CUDA is at speeding up a “one-shot” calculation that takes 20ms serially with a Core2Duo. [When I say “one-shot”, I mean the data entering that calculation is different every 3s, so the calculation must happen on it’s own – i.e. I can’t bundle together the 20ms calculations]

The calculation I’m currently evaluating consists of:
– a 64K 1D float array
– 2 FFTs (R2C/C2R)
– an array swap of the complex data (swap [0] & [1] dimensions)
– and a moving average filter

The array swap and moving average filter could theoretically be done entirely in parallel, if there were no hardware constraints, or at least heavily parallelized in a block/thread environment.

My external application would continue running in 3s chunks – so perhaps the CUDA initialization hit would only happen once, on the very first run(?) If so, that removes one hurdle.

However, if CUDA is not suited to speeding up a one-shot, serial 20ms calculation because of memory/kernel overhead, then it doesn’t help me in my case (and that frees me to buy Mass Effect and use my graphics card for something else) :)

Any thoughts?

It should be OK.

(In my mind, the only chance of it not working is if you unload any DLL’s or “kill” the CUDA context every 3 seconds.)