Hello,
I am using cuSolverSP to solve two separate batched solvers. Each one has 3104 sets of matrices. In order to meet time constraints I need to stream these solvers to run concurrently.
When I use the profiler it seems that it only a very small portion of the cusolverSpCcsrqrsvBatched is running concurrently.
static const uint zf_stream = 0;
static const uint mmse_stream = 1;
cudaStream_t stream[2];
cudaStreamCreate(&stream[zf_stream]);
cudaStreamCreate(&stream[mmse_stream]);
cusolverSpSetStream(cusolverH_zf, stream[zf_stream]);
cusolverSpSetStream(cusolverH_mmse, stream[mmse_stream]);
// Stream zf_stream
cusolverSpCcsrqrsvBatched(cusolverH_zf, m, m, nnzA, descrA_zf, dev_hhh_csr, dev_csrRowPtrhhh_zf, ev_csrColIdxhhh_zf, dev_hh_un0, dev_c_zf, batchSize, info_zf, dev_buffer_qr_zf);
// Stream mmse_stream
cusolverSpCcsrqrsvBatched(cusolverH_mmse, m, m, nnzA, descrA_mmse, dev_hhh_csr, dev_csrRowPtrhhh_mmse, ev_csrColIdxhhh_mmse, dev_hh_un0, dev_c_mmse, batchSize, info_mmse, dev_buffer_qr_mmse);
// Stream mmse_stream
cusolverSpCcsrqrsvBatched(cusolverH_mmse, m, m, nnzA, descrA_mmse, dev_hhh_csr, dev_csrRowPtrhhh_mmse, ev_csrColIdxhhh_mmse, dev_hh_un0, dev_c_mmse, batchSize, info_mmse, dev_buffer_qr_mmse);
cudaStreamDestroy(stream[zf_stream]);
cudaStreamDestroy(stream[mmse_stream]);
I profiled my code to test if it is running concurrently. If I attached the picture correctly you can see that I have 2 streams separate from the default stream. It seems like that main part of the cuSolver call is “csrqr_batch_core.” It never overlaps.
I triple checked the documentation to be sure that inputing each parameter is correct. I get correct answers.
or Try: https://drive.google.com/file/d/0B8m-8YH4s4zhNERfVy00emoxNzg/view?usp=sharing
I also tried using cudaStreamCreateWithFlags with cudaStreamNonBlocking but that causes my main loop to only run one time then die after that. I had to hard reboot to get my machine running again.
cudaStreamCreateWithFlags(&stream[zf_stream],cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[mmse_stream],cudaStreamNonBlocking);
cusolverSpSetStream(cusolverH_zf, stream[zf_stream]);
cusolverSpSetStream(cusolverH_mmse, stream[mmse_stream]);