About perfomance of cusparse QR solver on V100 GPU

Hi,

We try to use cusparse QR direct solver cusolverSpDcsrlsvqr for two cases on a V100 GPU but the performances on the V100 GPU are very similar with these on the Quadro T2000 mobile GPU.

  • m=n=5890, nnz=37040, takes 80ms on V100 .vs. 85ms on T2000

  • m=n=17036,nnz=129412, takes 215ms on V100 .vs. 220ms on T2000.

Is the performance reasonable on V100 GPU?

Thanks!

MVH, Jing

Have you looked at the two scenarios with the CUDA profiler? If so, any salient differences?

I have no insights into the specific functionality mentioned, but I am wondering whether this could be a case where memory latency is a major bottleneck.

Hi,

Thanks for your response.

Yes, we used nvprof to do some profiling on that solver. Most times spend on the three csrqr sub-functions.

GPU activities:   81.69%  291.79ms         6  **48.631ms**  43.199ms  52.016ms  void csrqr_leftLooking_cta_byLevels_core<double, double, int=8>(int, int, double*, int const *, int const *, int const *, int const *, int const *, int const *, double*, double*, int*, int*, int*, int*, int*, int)
                   10.64%  38.018ms         6  **6.3364ms**  5.7612ms  6.9147ms  void csrqr_solve_Qtb_cta_core<double, int=8>(int, int, double const *, int const *, int const *, int const *, int const *, int const *, int const *, double*, int*, int*)
                    4.47%  15.973ms         6  **2.6622ms**  2.4035ms  2.9362ms  void csrqr_upper_direct_kernel<double, int=5, int=3>(int, double const *, int const *, int const *, int const *, double const *, double const *, double*, int*, int*)

The execution times on the three functions are similar on V100 and Quadro T2000 mobile. However, the performance on a P100 is worse than that on V100 as expected. So, probably that is not a issue of memory bandwidth.

Thanks. /Jing

What I meant (I should have said it explicitly) are the various metrics that show how efficiently various GPU resources are used. The profiler can help you pinpoint the bottlenecks in the code. Is it limited by computation throughput of certain functional units? Is it limited by memory throughput? Etc.

Hi,

Thanks for your comments.

What profiler/flags can be used to figure out the limitations that you mentioned? Moreover, if these limitations can be identified, what can we do tuning for cuSparse library via the interface e.g. cusolverSpDcsrlsvqr?

Thanks. /Jing