Calling cuDSS functions from multiple CPU host threads

Hi, I’m using cuDSS and have had success in using it for a single CPU threaded application. Now, I want to use it in a multithreaded CPU application. Everything I have related to CUDA is stored in a single class. Each class has its own cudssHandle, cudaStream and relevant matrices. I thought this was the correct way to setup a multithreaded application, but I’m getting read access errors on cudssExecute() that don’t occur when I run everything in serial. Is this the correct way of setting things up for a multithreaded application? Interestingly, I also have cuSparse and cuBLAS operations in the same class, with each class also having its own cusparseHandle and cublasHandle, but the memory read access errors always occur on cudssExecute(), never on any cuSparse or cuBLAS operations. I know cuDSS is quite new, and I can’t find any information about its thread safety (whereas its explicitly stated in the docs that cuSparse and cuBLAS are thread-safe for this setup with one handle per thread)

Hi!

Do you have a code which we can use as a reproducer? My expectations were that cudss will behave similarly to cuBLAS and cuSPARSE in cases like yours when each thread has its own cudss objects. It might be that the failures are caused by the same reasons as in Recreating cuDSS matrix causes access violation reading location error if you call cudssExecute multiple times with the same phase.

Thanks,
Kirill

Hi Kirill,

Thanks for that. I’ve got cuDSS working with multiple host threads, but I’m only seeing a slight speedup. Using the visual profiler, I’m not seeing the concurrency I expected.


I thought multiple streams would allow the kernels to act concurrently but they seem to act pretty much sequentially. Each host thread has its own cuda stream and cudsshandle, which I set with cudssSetStream(). I’ve used paged host memory to allow for concurrent memory copy and kernel execution (although the vast majority of the memory copying in this case seems to be device to device called by cuDSS). Do you know if there’s any way to get the kernels to act concurrently? Or to at least improve the concurrency somewhat?

Thanks, Ben.

Hi Ben!
Not sure, what’s going on with the host threads but I’d double check the threading code and environment. I’ve done an experiment where user threads were set up like here:

and each thread run cudss with its own objects (I didn’t even use the paged host memory) and the profiler looked like this


where it is clear that solving step was done concurrently.
Also note, that as of cudss 0.2.0, reordering ~ analysis phase is done mostly on CPU so concurrent execution on GPU is mostly limited to factorization and solution phases.

Thanks, Kirill

Hi Kirill,

I’ve done some testing, and I think my setup was correct, but I wasn’t seeing kernel concurrency because the matrices I used were too large, so each kernel would “saturate” the GPU. Testing the same setup but with much smaller matrices, I’m able to see the concurrency I initially expected.

Thanks, Ben.

Hi Ben,

This is great news! Let us know if you see any issues with concurrency or asynchronous execution in the future.
As a note: we do plan to do improvements at some point which will make the execution more asynchronous. E.g., even right now, by default synchronous device memory allocation is used, but with the existing (in cudss 0.2.0) feature for the user-defined device memory allocators via cudssSetDeviceMemHandler() a user can use cudaMallocAsync.

Thanks, Kirill