cuSolver stream parallelism


I am using cuSolver in a project to make an LU decomposition and reuse it many times.
I have several decompositions and would like to process them in parallel. For that, i am using different streams for each system.

for this simple test, I created 4 4x4 matrices, and each solve is applied to 3 vectors (x y z)

the bulk of the code is ommited, but the main loop is like this:


for (auto j = 0; j < 100; j++) {
 for (auto i = 0; i < 4; i++)

 for (auto i = 0; i < 4; i++)
 cusolverDnSetStream(hdl, stream[i]);
 cusolverDnSgetrs(hdl,CUBLAS_OP_N,m,3,A[i],m,NULL, b[i], m, NULL);
 for (auto i = 0; i < 4; i++)


this is the kind of behavior I am getting:

the first and last batch, are regular cuda kernels implemented here. they are concurrent, as expected.

between getrs calls, cusolver seems to be creating an event and checking if the computation has finished. since they are using different streams, i would expect the computations to be independent.

Could someone help me figure out why this is happening?
thank you