Hi,
Thanks for your response.
Yes, we used nvprof to do some profiling on that solver. Most times spend on the three csrqr sub-functions.
GPU activities: 81.69% 291.79ms 6 **48.631ms** 43.199ms 52.016ms void csrqr_leftLooking_cta_byLevels_core<double, double, int=8>(int, int, double*, int const *, int const *, int const *, int const *, int const *, int const *, double*, double*, int*, int*, int*, int*, int*, int)
10.64% 38.018ms 6 **6.3364ms** 5.7612ms 6.9147ms void csrqr_solve_Qtb_cta_core<double, int=8>(int, int, double const *, int const *, int const *, int const *, int const *, int const *, int const *, double*, int*, int*)
4.47% 15.973ms 6 **2.6622ms** 2.4035ms 2.9362ms void csrqr_upper_direct_kernel<double, int=5, int=3>(int, double const *, int const *, int const *, int const *, double const *, double const *, double*, int*, int*)
The execution times on the three functions are similar on V100 and Quadro T2000 mobile. However, the performance on a P100 is worse than that on V100 as expected. So, probably that is not a issue of memory bandwidth.
Thanks. /Jing