multi-threading with cusparse lib

I’m working on an application that creates a couple of CPU thread to do some work with sahred cusparse handle. I’d like every thread to use an assigned to it cudaStream.
I know there is cusparseSetStream() which sets a stream to a handle.
My worry is that there can be a race condition between CPU threads and everything gets messed up.
For now, I have a mutex object associated with the shared cusparse handle.

My question is, does cusparsSetStream() working in similar way as cudaSetDevice() with respect to internally associating threads to streams ?
If so (or not) is it the same in cublas and cufft ?

Oh, and yes I do know I can have a handle per each thread, but that’s not an answer to my problem.
In my experience, creating multiple handles is quite memory consuming.

Best wishes,

I checked with the team in charge of CUDA’s linear algebra libraries. The CUSPARSE and CUBLAS libraries are thread safe up to a point, but not in the sense you described above:

  1. CUSPARSE and CUBLAS libraries are thread safe from the point view of textures. They use locks to make “texture bind/kernel/texture unbind” into an atomic section.

  2. The CUSPARSE and CUBLAS libraries are not thread safe from the point of view of streams. If the user is using multiple streams and does not want to create multiple handles (one for each thread), the solution is to use his their own lock.

Can you explain why using multiple contexts is unacceptable in your case? As you note, each context requires a fair amount of memory, but unless many contexts are created that should be tolerable? One design alternative would be to use a “worker thread” approach where only a single thread creates a GPU context and all other threads communicate with that worker thread.

Hi njuffa thank you for your answer.
Would you mind also asking if the mentioned libraries are thread safe(or not)?
For example, are they prepared for eventuality of the same functions being called by different CPU threads simultaneously, on the same GPU stream/context ?
Say cusparseCcoo2csr(), cusparseCcsr2csc(), cusparseCcsrmm() are called in this order by two or more CPU threads. Is there any internal dependency ?

I’m working on an application that needs to dynamically create sparse matrix per each matrix-matrix multiplication.
In iterative single threaded approach it looks like this:

for(unsigned int iL = 0; iL < nCSSets; iL++)
	unsigned int iG = iL % NCSGRIDDERS;
	std::cout << "Loop : " << iL << " Gridder: " << iG << std::endl;
	rotTrajGen.setAngle(GASENSE::GoldAngl() * iL);
		// Allocate memory on CPU and calculate sparse matrix (coo)
	gridder[iG].compNMatrixShifted3D(rotTrajGen, 1);
		// Move data to GPU, allocate additional storage 
		// and call cusparse*coo2csr, free temporal storage.
		// Calculate transpose of the matrix.
		// cusparse*csr2csc

	dummy.startROI.z =  
	tmpCS.getFlatK().startROI.z = iL * tmpCS.getFlatK().sizeROI.z;
		// call cusparse*csrmm
	gridder[iG].grid(tmpCS.getFlatK(), dummy, Gridder::GRID_H);		
		// Synchronize if we ran out of gridding objects.
	if(iG == NCSGRIDDERS - 1)

This results in slow execution and practically no overlap between memory transfers and execution.
After profiling I found it’s because cudaFree() - synchronizes the CPU thread.

This can be overcome with sharing memory space between different objects or allocating enough space for each possible operation. Unfortunately, this isn’t a very flexible approach.

The other way is to create a worker thread to deallocate memory.
I found that allowed to squash execution blocks more tightly on a time line.
Yet, I’ve got much more speed up and a very obvious overlap if I associate each for-iteration with separate thread.
Unfortunately, this generates incorrect results of multiplications. Memorycheck shows memory access violations in cusparse*csr2csc functions after the first iteration (one of threads finishing).

This makes me think the cusparse library is not thread safe, but I need to re-check my tests to be sure it’s not caused by a bug in my code.

This is just a small part of bigger iterative algorithm.
nCSSets is in range between 60 to 500 or more.
Multiplication operands are of sizes [12x~3000] * [~3000x(160*160)] of complex data.
Where the bigger matrix is stored in sparse format. Typically ~40000 to ~160000 non-zero values are expected.
The sparse matrix is required in it’s transpose form, as well.
The application also uses cufft and cublas libraries. This requires creation of context handlers for them, as well.

I’d like to be able to run at least four separate CPU threads.
As far as I remember cusparse was crushing with ‘internal-error’ message from cusparsecoo2csr or cusparsecsr2csc, if I created a separate cusparse handle per each gridding object (thread).
Manual says this is most likely caused by failure in internal memory allocation or data transmission.

Again, I’ll be re-testing all of this options.

My development platform: cuda 4.2, GTX 650m 1GB RAM and production: Tesla C2070.