Streaming cuSolver

Hello,

I am using cuSolverSP to solve two separate batched solvers. Each one has 3104 sets of matrices. In order to meet time constraints I need to stream these solvers to run concurrently.

When I use the profiler it seems that it only a very small portion of the cusolverSpCcsrqrsvBatched is running concurrently.

static const uint zf_stream = 0;
static const uint mmse_stream = 1;
cudaStream_t stream[2];
cudaStreamCreate(&stream[zf_stream]);
cudaStreamCreate(&stream[mmse_stream]);
cusolverSpSetStream(cusolverH_zf, 		stream[zf_stream]);
cusolverSpSetStream(cusolverH_mmse, 	stream[mmse_stream]);

// Stream zf_stream
cusolverSpCcsrqrsvBatched(cusolverH_zf, m, m, nnzA, descrA_zf, dev_hhh_csr, dev_csrRowPtrhhh_zf, ev_csrColIdxhhh_zf, dev_hh_un0, dev_c_zf, batchSize, info_zf, dev_buffer_qr_zf);
// Stream mmse_stream
cusolverSpCcsrqrsvBatched(cusolverH_mmse, m, m, nnzA, descrA_mmse, dev_hhh_csr, dev_csrRowPtrhhh_mmse, ev_csrColIdxhhh_mmse, dev_hh_un0, dev_c_mmse, batchSize, info_mmse, dev_buffer_qr_mmse);
// Stream mmse_stream
cusolverSpCcsrqrsvBatched(cusolverH_mmse, m, m, nnzA, descrA_mmse, dev_hhh_csr, dev_csrRowPtrhhh_mmse, ev_csrColIdxhhh_mmse, dev_hh_un0, dev_c_mmse, batchSize, info_mmse, dev_buffer_qr_mmse);

cudaStreamDestroy(stream[zf_stream]);
cudaStreamDestroy(stream[mmse_stream]);

I profiled my code to test if it is running concurrently. If I attached the picture correctly you can see that I have 2 streams separate from the default stream. It seems like that main part of the cuSolver call is “csrqr_batch_core.” It never overlaps.

I triple checked the documentation to be sure that inputing each parameter is correct. I get correct answers.

or Try: https://drive.google.com/file/d/0B8m-8YH4s4zhNERfVy00emoxNzg/view?usp=sharing

I also tried using cudaStreamCreateWithFlags with cudaStreamNonBlocking but that causes my main loop to only run one time then die after that. I had to hard reboot to get my machine running again.

cudaStreamCreateWithFlags(&stream[zf_stream],cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[mmse_stream],cudaStreamNonBlocking);
cusolverSpSetStream(cusolverH_zf, 		stream[zf_stream]);
cusolverSpSetStream(cusolverH_mmse, 	stream[mmse_stream]);

What makes you think a batched solver with 3104 matrices should overlap with another batched solver of 3104 matrices?

The device does not have infinite carrying capacity. When the submitted work is sufficient to occupy the device, new work issued will wait until the previous work has completed, or has freed up sufficient resources for the new work to begin executing.

The profiler output shows that the kernels are indeed being issued to separate streams.

There are two possibilities that remain:

  1. The issued kernels occupy the device, preventing other kernels from executing (until completion).
  2. A synchronizing call is being issued by the library, after the kernel call, and before the issuance of the next kernel. If this were the case, it might be a design limitation in the library. However, a single picture from a profiler output is not enough for me to explore this, and you haven’t provided a working code.

Once again, Thanks txbob.

Your question makes sense. I see how the cuSolver could full a whole GPU (I am sure that is what all the cusolverSpXcsrqrAnalysisBatched stuff is for) and it may also have an internal synchronizing call.

Luckily I have another GPU at my disposal. I switched my second call to my other GPU and it runs awesome. Everything is concurrent. I cut off about 400ms!!!

Thanks again!!!