Streaming cuSolver

jravert · June 8, 2015, 8:36pm

Hello,

I am using cuSolverSP to solve two separate batched solvers. Each one has 3104 sets of matrices. In order to meet time constraints I need to stream these solvers to run concurrently.

When I use the profiler it seems that it only a very small portion of the cusolverSpCcsrqrsvBatched is running concurrently.

static const uint zf_stream = 0;
static const uint mmse_stream = 1;
cudaStream_t stream[2];
cudaStreamCreate(&stream[zf_stream]);
cudaStreamCreate(&stream[mmse_stream]);
cusolverSpSetStream(cusolverH_zf, 		stream[zf_stream]);
cusolverSpSetStream(cusolverH_mmse, 	stream[mmse_stream]);

// Stream zf_stream
cusolverSpCcsrqrsvBatched(cusolverH_zf, m, m, nnzA, descrA_zf, dev_hhh_csr, dev_csrRowPtrhhh_zf, ev_csrColIdxhhh_zf, dev_hh_un0, dev_c_zf, batchSize, info_zf, dev_buffer_qr_zf);
// Stream mmse_stream
cusolverSpCcsrqrsvBatched(cusolverH_mmse, m, m, nnzA, descrA_mmse, dev_hhh_csr, dev_csrRowPtrhhh_mmse, ev_csrColIdxhhh_mmse, dev_hh_un0, dev_c_mmse, batchSize, info_mmse, dev_buffer_qr_mmse);
// Stream mmse_stream
cusolverSpCcsrqrsvBatched(cusolverH_mmse, m, m, nnzA, descrA_mmse, dev_hhh_csr, dev_csrRowPtrhhh_mmse, ev_csrColIdxhhh_mmse, dev_hh_un0, dev_c_mmse, batchSize, info_mmse, dev_buffer_qr_mmse);

cudaStreamDestroy(stream[zf_stream]);
cudaStreamDestroy(stream[mmse_stream]);

I profiled my code to test if it is running concurrently. If I attached the picture correctly you can see that I have 2 streams separate from the default stream. It seems like that main part of the cuSolver call is “csrqr_batch_core.” It never overlaps.

I triple checked the documentation to be sure that inputing each parameter is correct. I get correct answers.

or Try: https://drive.google.com/file/d/0B8m-8YH4s4zhNERfVy00emoxNzg/view?usp=sharing

I also tried using cudaStreamCreateWithFlags with cudaStreamNonBlocking but that causes my main loop to only run one time then die after that. I had to hard reboot to get my machine running again.

cudaStreamCreateWithFlags(&stream[zf_stream],cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[mmse_stream],cudaStreamNonBlocking);
cusolverSpSetStream(cusolverH_zf, 		stream[zf_stream]);
cusolverSpSetStream(cusolverH_mmse, 	stream[mmse_stream]);

Robert_Crovella · June 9, 2015, 1:30am

What makes you think a batched solver with 3104 matrices should overlap with another batched solver of 3104 matrices?

The device does not have infinite carrying capacity. When the submitted work is sufficient to occupy the device, new work issued will wait until the previous work has completed, or has freed up sufficient resources for the new work to begin executing.

The profiler output shows that the kernels are indeed being issued to separate streams.

There are two possibilities that remain:

The issued kernels occupy the device, preventing other kernels from executing (until completion).
A synchronizing call is being issued by the library, after the kernel call, and before the issuance of the next kernel. If this were the case, it might be a design limitation in the library. However, a single picture from a profiler output is not enough for me to explore this, and you haven’t provided a working code.

jravert · June 9, 2015, 10:58pm

Once again, Thanks txbob.

Your question makes sense. I see how the cuSolver could full a whole GPU (I am sure that is what all the cusolverSpXcsrqrAnalysisBatched stuff is for) and it may also have an internal synchronizing call.

Luckily I have another GPU at my disposal. I switched my second call to my other GPU and it runs awesome. Everything is concurrent. I cut off about 400ms!!!

Thanks again!!!

Topic		Replies	Views
cusparse concurrency using streams CUDA Programming and Performance	3	1709	July 19, 2013
Parallel large SVD GPU-Accelerated Libraries cusolver	5	1083	October 20, 2023
My streams are not running concurrently CUDA Programming and Performance	7	1761	March 6, 2018
Streaming CuSolver SVD GPU-Accelerated Libraries	0	569	December 7, 2017
cuSolver stream parallelism GPU-Accelerated Libraries	0	647	March 7, 2018
cuSolver SVD not overlapping using streams GPU-Accelerated Libraries cusolver	8	30	September 19, 2024
Why streams cant run concurrently CUDA Programming and Performance	4	920	March 22, 2018
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1441	September 14, 2017
cufft concurrent streams CUDA Programming and Performance	2	1885	August 20, 2014
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1243	November 19, 2010

Streaming cuSolver

Related topics