Implementation of sparse triangular solver in cuSparse

I am writing a sparse triangular solver (Ax = b) based on the paper of “Parallel Solution of Sparse Triangular Linear Systems
in the Preconditioned Iterative Methods on the GPU”, here is the link https://research.nvidia.com/sites/default/files/publications/nvr-2011-001.pdf
I’ve used the concept of level sets and chains mentioned in the paper, and use CSR to store and access the matrix, and as the paper says, each thread processes one row.
However, my result is a lot worse than cusparseDcsrsv2_solve of cuSparse when the DAG levels is (> 10). I am looking for more implementation details of kernel function, which is left out in the paper. And is there any trick on memory access in the kernel ?