Hi njuffa thank you for your answer.
Would you mind also asking if the mentioned libraries are thread safe(or not)?
For example, are they prepared for eventuality of the same functions being called by different CPU threads simultaneously, on the same GPU stream/context ?
Say cusparseCcoo2csr(), cusparseCcsr2csc(), cusparseCcsrmm() are called in this order by two or more CPU threads. Is there any internal dependency ?
I’m working on an application that needs to dynamically create sparse matrix per each matrix-matrix multiplication.
In iterative single threaded approach it looks like this:
for(unsigned int iL = 0; iL < nCSSets; iL++)
{
unsigned int iG = iL % NCSGRIDDERS;
std::cout << "Loop : " << iL << " Gridder: " << iG << std::endl;
rotTrajGen.setAngle(GASENSE::GoldAngl() * iL);
// Allocate memory on CPU and calculate sparse matrix (coo)
gridder[iG].compNMatrixShifted3D(rotTrajGen, 1);
// Move data to GPU, allocate additional storage
// and call cusparse*coo2csr, free temporal storage.
gridder[iG].compres();
// Calculate transpose of the matrix.
// cusparse*csr2csc
gridder[iG].N2H();
dummy.startROI.z =
tmpCS.getFlatK().startROI.z = iL * tmpCS.getFlatK().sizeROI.z;
// call cusparse*csrmm
gridder[iG].grid(tmpCS.getFlatK(), dummy, Gridder::GRID_H);
FFTControl::setNexStream();
// Synchronize if we ran out of gridding objects.
if(iG == NCSGRIDDERS - 1)
{
FFTControl::syncAllStreams();
FFTControl::setStream(0);
}
}
This results in slow execution and practically no overlap between memory transfers and execution.
After profiling I found it’s because cudaFree() - synchronizes the CPU thread.
This can be overcome with sharing memory space between different objects or allocating enough space for each possible operation. Unfortunately, this isn’t a very flexible approach.
The other way is to create a worker thread to deallocate memory.
I found that allowed to squash execution blocks more tightly on a time line.
Yet, I’ve got much more speed up and a very obvious overlap if I associate each for-iteration with separate thread.
Unfortunately, this generates incorrect results of multiplications. Memorycheck shows memory access violations in cusparse*csr2csc functions after the first iteration (one of threads finishing).
This makes me think the cusparse library is not thread safe, but I need to re-check my tests to be sure it’s not caused by a bug in my code.
This is just a small part of bigger iterative algorithm.
nCSSets is in range between 60 to 500 or more.
Multiplication operands are of sizes [12x~3000] * [~3000x(160*160)] of complex data.
Where the bigger matrix is stored in sparse format. Typically ~40000 to ~160000 non-zero values are expected.
The sparse matrix is required in it’s transpose form, as well.
The application also uses cufft and cublas libraries. This requires creation of context handlers for them, as well.
I’d like to be able to run at least four separate CPU threads.
As far as I remember cusparse was crushing with ‘internal-error’ message from cusparsecoo2csr or cusparsecsr2csc, if I created a separate cusparse handle per each gridding object (thread).
Manual says this is most likely caused by failure in internal memory allocation or data transmission.
Again, I’ll be re-testing all of this options.
My development platform: cuda 4.2, GTX 650m 1GB RAM and production: Tesla C2070.
Thanks,
Greg.