Why cusparseDcsrsv_solve so slow?

I’m using a GTX1080Ti to implement the BiCGSTAB algorithm according to Dr. Maxim Naumov’s papers, but the performance is not pleased. There are two problems:

  1. I have implemented this method on GTX 1050 2G, it is 30-40% slower than GTX 1080Ti, but it can hold many instances simultaneously with little performance loss(e.g. run 1 takes 4s on GTX1050, run 3 together takes 5s; but run 1 takes 3s on GTX 1080Ti, run 3 takes 8s)
  2. There are so many gaps when solve the sparse matrix using cusparseDcsrsv_solve, see below, so in fact only limited compute time has been used. Is there a way to improve this?
    https://s1.ax1x.com/2018/03/08/92KdKS.png