I have a C# exe that loads a no-cuda C++ dll on demand (actually while working on this project I’ve changed it to always loading at startup). That dll demand-loads another dll that contains all the cuda code.
The cuda code is working great with no open issues, except that the 150-200 ms “first-call overhead” is incurred every time I run the method that I’m using for testing/debugging the project.
Specifically, the C# test method currently calls about 6 different kernels multiple times for a total of about 50 kernel calls. The 200 ms overhead happens only on the first kernel call (or when cudaSetDevice is called, if I choose to do it that way). After that everything is good for the duration of the method (well, sometimes the second call and occasionally the third are a bit slow also, but from what I have read about this subject I think this is expected). The next time the method is called I incur the overhead again.
This amount of overhead makes the whole effort worthless since the user will only be calling one 50-100 ms kernel at a time and with the overhead I can do much better with cpu cores + simd.
I’m thinking that either I’m missing a really easy fix, or else I’m in big trouble and the project needs to be restructured, which there are good reasons to avoid. Can anyone help me out with this?
I looked at MPS briefly and decided it was overkill and probably not applicable, but if someone posts that it might be a solution then I will jump into it.
I’m going to try the CUDA_MODULE_LOADING=LAZY thing later this morning, thanks for the suggestion.
LAZY means slowdown at first function invocation time (for each function, I think the new default), EAGER means slowdown at loading time (I think the old default).
You could start a thread for handling CUDA at DLL load time. It initializes everything in the background so that Cuda is ready for the first function invocation.
I wouldn’t need to offload to a thread if pre-handling worked … if I only needed to do it once then for 200 ms I could do it anywhere/anytime. But if you mean something like cudaSetDevice(), it doesn’t work. If I call cudaSetDevice() inside the C# method then it does its job (i.e. little or no overhead in the kernel calls), but I have to do it every time. If I call cudaSetDevice() outside the C# method then the call is successful but it doesn’t help the problem - the overhead is still there each time the method runs.
I’m calling two different batches of kernels depending on which of 2 branches I’m testing/debugging. Both branches end up calling 5 or 6 kernels. All the kernals are similar, with a bunch of cudaMalloc() calls among other things.
I have not jumped into Nsight yet because it seems pretty clear what it happening - cuda is “forgetting” that the initialization is done and just doing it all over again. Also I’m a bit reluctant to jump into Nsight because I’m afraid it will be a pain to crank up for a case like this with C# → non-cuda dll → cuda dll. But I will do it if I have some idea what I’m looking for.
Ok I tried both CUDA_MODULE_LOADING=LAZY and CUDA_MODULE_LOADING=EAGER. No noticable difference between those two and no CUDA_MODULE_LOADING environment variable. I’m assuming that I don’t need to reference the variable in my project, it’s just something that cuda grabs on its own?
Can you try to find out, which calls take up the time? E.g. with cudaDeviceSynchronize() in between to cover asynchronous functions.
cudaMalloc() can be slow and should be typically done only at the beginning of the program (if possible).
That turns out to be an excellent suggestion. This is much more “interesting” than I was thinking. In the cuda code I’m first doing several smaller cudaMalloc() and cudaMemcpy() calls … exiting after those there is no overhead. Then I’m doing 4 much larger cudaMalloc() calls and 2 larger cudaMemcpy() calls … exiting after those, the 1st call has no overhead but the 2nd call has all the overhead. Then I’m doing some non-cuda work and two more bigger cudaMemcpy() calls … exiting after those, sometimes the 2nd call has all the overhead and sometimes the 1st and 2nd calls split the overhead. After that the kernel is called, and exiting immediately after the kernel finishes it goes back to what I’m accustomed to seeing with all or almost all the overhead always in the 1st call.
Sorry, I did all that forgetting about the cudaDeviceSynchronize() calls … will try that next.
Ok I switched to a simpler branch with fewer memory-related lines, and added the cudaDeviceSynchronize() calls before exiting. The results were very similar to the above with the 1st and 2nd calls splitting the overhead most of the time, until the kernel is called after which it’s all in the first call. Does this suggest any conclusions?
Sorry for not being clear. “Exiting” means doing an early return of the function doing all the cuda work (i.e. a quick-and-dirty way to investigate this issue without setting up for proper profiling). So I’m just sticking in different early returns and recompiling. To get the reported results I’m running the basic function call 5 times and printing the timer results.
I think maybe I just need to buckle down and get Nsight going if possible. I will jump on that this morning. Since you seem to be very knowledgeable I’m assuming there are no “easy and obvious” suggestions to be made. I’m new with this cuda stuff so I was hoping there would be a solution based on something everyone but me found obvious and elementary.
I can’t do that as part of any solution because the buffer sizes are highly variable. To cover all expected parameter ranges I would need to allocate 1000x as much memory as the most typical case.