multi-threading with cusparse lib

Glupol · March 16, 2013, 4:38pm

Hi,
I’m working on an application that creates a couple of CPU thread to do some work with sahred cusparse handle. I’d like every thread to use an assigned to it cudaStream.
I know there is cusparseSetStream() which sets a stream to a handle.
My worry is that there can be a race condition between CPU threads and everything gets messed up.
For now, I have a mutex object associated with the shared cusparse handle.

My question is, does cusparsSetStream() working in similar way as cudaSetDevice() with respect to internally associating threads to streams ?
If so (or not) is it the same in cublas and cufft ?

Oh, and yes I do know I can have a handle per each thread, but that’s not an answer to my problem.
In my experience, creating multiple handles is quite memory consuming.

Best wishes,
Greg.

njuffa · March 18, 2013, 6:19pm

I checked with the team in charge of CUDA’s linear algebra libraries. The CUSPARSE and CUBLAS libraries are thread safe up to a point, but not in the sense you described above:

CUSPARSE and CUBLAS libraries are thread safe from the point view of textures. They use locks to make “texture bind/kernel/texture unbind” into an atomic section.
The CUSPARSE and CUBLAS libraries are not thread safe from the point of view of streams. If the user is using multiple streams and does not want to create multiple handles (one for each thread), the solution is to use his their own lock.

Can you explain why using multiple contexts is unacceptable in your case? As you note, each context requires a fair amount of memory, but unless many contexts are created that should be tolerable? One design alternative would be to use a “worker thread” approach where only a single thread creates a GPU context and all other threads communicate with that worker thread.

Glupol · March 20, 2013, 12:54am

Hi njuffa thank you for your answer.
Would you mind also asking if the mentioned libraries are thread safe(or not)?
For example, are they prepared for eventuality of the same functions being called by different CPU threads simultaneously, on the same GPU stream/context ?
Say cusparseCcoo2csr(), cusparseCcsr2csc(), cusparseCcsrmm() are called in this order by two or more CPU threads. Is there any internal dependency ?

I’m working on an application that needs to dynamically create sparse matrix per each matrix-matrix multiplication.
In iterative single threaded approach it looks like this:

for(unsigned int iL = 0; iL < nCSSets; iL++)
{ 
	unsigned int iG = iL % NCSGRIDDERS;
	std::cout << "Loop : " << iL << " Gridder: " << iG << std::endl;
	rotTrajGen.setAngle(GASENSE::GoldAngl() * iL);
		// Allocate memory on CPU and calculate sparse matrix (coo)
	gridder[iG].compNMatrixShifted3D(rotTrajGen, 1);
		// Move data to GPU, allocate additional storage 
		// and call cusparse*coo2csr, free temporal storage.
	gridder[iG].compres();
		// Calculate transpose of the matrix.
		// cusparse*csr2csc
	gridder[iG].N2H();

	dummy.startROI.z =  
	tmpCS.getFlatK().startROI.z = iL * tmpCS.getFlatK().sizeROI.z;
		// call cusparse*csrmm
	gridder[iG].grid(tmpCS.getFlatK(), dummy, Gridder::GRID_H);		
	FFTControl::setNexStream();
		// Synchronize if we ran out of gridding objects.
	if(iG == NCSGRIDDERS - 1)
	{
		FFTControl::syncAllStreams();
		FFTControl::setStream(0);
	}
}

This results in slow execution and practically no overlap between memory transfers and execution.
After profiling I found it’s because cudaFree() - synchronizes the CPU thread.

This can be overcome with sharing memory space between different objects or allocating enough space for each possible operation. Unfortunately, this isn’t a very flexible approach.

The other way is to create a worker thread to deallocate memory.
I found that allowed to squash execution blocks more tightly on a time line.
Yet, I’ve got much more speed up and a very obvious overlap if I associate each for-iteration with separate thread.
Unfortunately, this generates incorrect results of multiplications. Memorycheck shows memory access violations in cusparse*csr2csc functions after the first iteration (one of threads finishing).

This makes me think the cusparse library is not thread safe, but I need to re-check my tests to be sure it’s not caused by a bug in my code.

This is just a small part of bigger iterative algorithm.
nCSSets is in range between 60 to 500 or more.
Multiplication operands are of sizes [12x~3000] * [~3000x(160*160)] of complex data.
Where the bigger matrix is stored in sparse format. Typically ~40000 to ~160000 non-zero values are expected.
The sparse matrix is required in it’s transpose form, as well.
The application also uses cufft and cublas libraries. This requires creation of context handlers for them, as well.

I’d like to be able to run at least four separate CPU threads.
As far as I remember cusparse was crushing with ‘internal-error’ message from cusparsecoo2csr or cusparsecsr2csc, if I created a separate cusparse handle per each gridding object (thread).
Manual says this is most likely caused by failure in internal memory allocation or data transmission.

Again, I’ll be re-testing all of this options.

My development platform: cuda 4.2, GTX 650m 1GB RAM and production: Tesla C2070.

Thanks,
Greg.

Topic		Replies	Views
Calling cuDSS functions from multiple CPU host threads GPU-Accelerated Libraries cudss	5	426	March 19, 2024
cublasSetWorkspace() does not free auto-allocated workspace memory, how to do so? GPU-Accelerated Libraries cublas , cufft	5	747	December 23, 2022
Hundreds of parallel matrix-vector multiplications with cuBLAS GPU-Accelerated Libraries	8	2252	April 8, 2021
cusparseHandle_t GPU-Accelerated Libraries	3	424	July 19, 2019
Thread-Safe? cublas.dll / cufft.dll CUDA Programming and Performance	6	4202	February 7, 2008
Does CUBLAS 4 RC-2 support using multiple contexts from a single host-thread? CUDA Programming and Performance	11	10619	August 19, 2011
cusparse concurrency using streams CUDA Programming and Performance	3	1708	July 19, 2013
Adding CUDA streams to threaded software CUDA Programming and Performance	7	8570	August 2, 2011
Using cuBLAS in different CUDA streams GPU-Accelerated Libraries	3	3486	June 3, 2015
Multiple GPU computing CUDA Programming and Performance	8	7872	May 7, 2008

multi-threading with cusparse lib

Related topics