I have a program which needs to create a large number of cuBLAS handles (~500-1000) which are all assigned independent streams using cublasSetStream() and independent workspaces using cublasSetWorkspace() so they can perform work asynchronously. While profiling this program I noticed it is using way more memory than I expected. After a little investigating I realized that this seems to happen because each cuBLAS handle has a default workspace which is NOT freed when I assign it a manually allocated one.
This seems to differ from the cufftSetWorkArea() function for the cuFFT library whose documentation states:
“If the work area was auto-allocated, cuFFT frees the auto-allocated space.”
There is no such statement in the cuBLAS documentation for cublasSetWorkspace(). Perhaps this is necessary since the cublasSetStream() function “unconditionally resets the cuBLAS library workspace back to the default workspace pool”, so I guess that default space still needs to exist incase the stream is reset and a new workspace is not assigned? This is also different from the cuFFT library whose set stream function does not reset the assigned workspace.
So my real question here is: is there anyway to free the auto-allocated workspace for a cuBLAS handle since I will not need to change the stream again once it is set and the auto-allocated workspace is taking up a significant amount of memory (8 MB per handle x 500-1000 handles). If not is there anyway to control the size of the auto-allocated workspace to make it smaller?
Why do you need 500-1000 separate handles? Also, the hardware can only support 32 simultaneous streams, then at best you might have 4 run in parallel depending on available resources. Is all of the running on a single CPU process? A handle manages all resources needed by the cuBLAS library. All operations and streams can use a single handle.
What operations from cuBLAS are you using?
@mnicely thanks for the reply. So it is news to me that you can only run 32 simultaneous streams. That certainly invalidates the need for 500+ separate handles. Nonetheless, I think the question is still important as even with just 32 handles I wouldn’t want to unnecessarily waste 256 MB of memory.
I am calling cuBLAS functions for each handle from different CPU threads. Operations I’m calling include cublasCgemm, cublasCgeam, luCublasComp, invCublasComp etc.
So it sounds like you are suggesting that I really only need 1 handle. I could still have 500, or 32, or however many separate workspaces for asynchronous processing, I would just need to set the stream and workspace for the handle before each call? This would result in only 1 default workspace wasting a negligible 8 MB of memory. The downside to that solution is then that I would need to make all cuBLAS calls from the same CPU thread to prevent race conditions with the stream and workspace setting. With that consideration I would still prefer to use separate handles if possible.
So I am still wondering if there is a way to reduce the size of, or free the default workspace for a cuBLAS handle.
Regarding reducing the default workspace in cuBLAS, I don’t think so as I believed it’s built on CUDA runtime. Where as cuFFT is built on the CUDA driver APIs. I’ll have to confirm once everyone gets back from the holidays.
I still don’t understand what you a hoping to accomplish. I’m assuming you don’t have 500-1000 different unique workflows. Therefore, my guess is that you’re trying to do a bunch of identical workflows in parallel? Maybe I’m wrong because you also mentioned race conditions.
Do you have a simple working example without multiple handles and streams? You also mentioned using 32 streams again. Our math libs are highly optimized and my use numerous resources per function. I doubt you’ll have more than four streams running in parallel unless you you have very small datasets, but that’s a whole different issue.
The GPU is running asynchronously. Say you have 1000 workflows you want to run. Start with two streams. Everything in a stream executes “in-order”. Use a round robin approach to launch 500 workflows in each stream. Let the hardware handle all the parallelism. You can use stream or device syncs to avoid race conditions. Once you have that working, use Nsight Systems to see if you have any available resources. If yes, increase the number of streams.
Remember, separate CPU processes to the same GPU means separate CUDA contexts, which means context switching unless you’re running MPS.
@mnicely I was not aware of the concept of context switching. Does this cause significant overhead?
Your guess was correct, I have identical workflows I am trying to do in parallel, each one currently being processed on it’s own CPU thread. I guess there is no reason I couldn’t do all this work on the same thread and at each step in the processing just fire off 500 cuBLAS calls on their respective streams (how ever many I decide to use) all using the same cuBLAS handle. This seems like a better solution which solves my memory issue and may also provide a performance benefit as the context switching is eliminated.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.