Abnormally slow performance

Hello everyone,
I’m dealing with a project where I decided to do Cholesky factorization via cuSolver library. When I test with CUDA samples, a 1000x1000 matrix takes ~10ms. However, when I test it within my project, its performance degraded into ~100ms.
I attached related code as shown below:

check_cuda_errors(cudaMemset(d_info, 0, sizeof(int)));
check_cusolver_errors(cusolverDnDpotrf(cusolver_handle, cublas_fillmode, ld, d_covariance, ld, cusolver_buffer, cusolver_buffersize, d_info));
check_cuda_errors(cudaMemcpy(&h_info, d_info, sizeof(int), cudaMemcpyDeviceToHost));
if (h_info != 0)
    if (screen) fprintf(screen, "Cholesky Factorization failed. Aborted.\n");
    if (logfile) fprintf(logfile, "Cholesky Factorization failed. Aborted.\n");

I profiled my application (above code is repeated for 100 times) with nvprof and below is profiled result:

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   64.12%  5.55337s       600  9.2556ms  4.4314ms  16.903ms  maxwell_dgemm_128x64_lower_nt
                   12.91%  1.11784s      4000  279.46us  89.377us  1.5853ms  void trsm_right_kernel<double, int=256, int=4, bool=0, bool=0, bool=1, bool=0, bool=0>(cublasTrsmParams<double>, double, double const *, int)
                   10.07%  871.77ms      1700  512.81us  78.080us  3.8615ms  maxwell_dgemm_64x64_lower_nt
                    5.47%  474.12ms      2000  237.06us  84.097us  1.3376ms  void magma_lds128_dgemm_kernel<bool=0, bool=1, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
                    2.34%  202.37ms       300  674.56us  586.60us  2.3859ms  maxwell_dgemm_64x64_nt
                    1.30%  112.45ms     18800  5.9810us  3.4240us  536.36us  void potrf_alg1_cta_lower<double, double, int=16>(int, int, double*, int, int*)
                    1.03%  89.221ms     17700  5.0400us  3.5200us  168.71us  void potrf_alg1_trsm_lower<double, double, int=16>(int, int, int, int, int, double*, int)
                    1.01%  87.055ms     16400  5.3080us  2.4960us  177.09us  void potrf_alg1_syrk_kernel<double, bool=0, bool=1, int=5, int=4, int=4, int=5, int=4>(int, int, double const *, int, double*, int, bool)
                    0.79%  68.742ms       100  687.42us  676.71us  1.1876ms  void trmv_tile_kernel<double, int=64, int=128, int=4, bool=0, bool=0, bool=0>(cublasTrmvTileParams<double>)
                    0.53%  45.508ms       700  65.011us  33.185us  344.90us  void gemm_kernel2x2_core<double, bool=0, bool=0, bool=0, bool=0, bool=1>(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int)
                    0.37%  32.309ms       100  323.09us  310.37us  875.66us  void syhemvl_generic_off_diag<double, bool=0, int=32, int=32, int=4, int=4>(int, int, double, double*, double*, int, double*, int, double*, int, int)
                    0.03%  2.9532ms         1  2.9532ms  2.9532ms  2.9532ms  generate_seed_pseudo(__int64, __int64, __int64, curandOrdering, curandStateXORWOW*, unsigned int*)
                    0.02%  1.3760ms       100  13.759us  13.184us  15.712us  void syhemvl_generic_diag<double, bool=0, int=32, int=32, int=4, int=4>(int, int, double, double*, double*, int, double*, int, double, double*, double*, int, int)
                    0.01%  936.97us       100  9.3690us  8.8640us  10.016us  [CUDA memcpy DtoD]
                    0.01%  536.29us       100  5.3620us  4.7680us  5.9210us  void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, __operator_&__(double2 curand_normal_scaled2_double<curandStateXORWOW>(curandStateXORWOW*, normal_args_double_st))>(curandStateXORWOW*, double2*, unsigned long, unsigned long, normal_args_double_st)
                    0.00%  153.63us       100  1.5360us  1.4720us  1.6640us  potrf_alg1_reset_info(int*)
                    0.00%  139.65us       100  1.3960us  1.3440us  1.6960us  potrf_alg1_set_info(int, int, int*)
                    0.00%  72.929us       100     729ns     672ns     928ns  [CUDA memset]
                    0.00%  60.577us       100     605ns     576ns  1.3440us  [CUDA memcpy DtoH]
      API calls:   61.40%  13.4332s     62801  213.90us  3.1780us  174.01ms  cudaLaunchKernel
                   37.93%  8.29686s       100  82.969ms  81.321ms  94.789ms  cudaMemcpy
                    0.33%  72.576ms       101  718.58us  682.14us  3.0673ms  cudaFree
                    0.15%  33.376ms       100  333.76us  320.40us  885.59us  cudaDeviceSynchronize
                    0.11%  24.411ms     83402     292ns      72ns  4.9087ms  cudaGetLastError
                    0.03%  6.7764ms      4000  1.6940us  1.2180us  281.31us  cudaBindTexture
                    0.01%  3.0297ms       102  29.703us  20.799us  168.35us  cudaMalloc
                    0.01%  2.7548ms      4000     688ns     432ns  284.74us  cudaUnbindTexture
                    0.01%  1.9823ms       100  19.822us  12.013us  55.253us  cudaMemcpy2DAsync
                    0.01%  1.4287ms       100  14.287us  8.0820us  214.85us  cudaMemset
                    0.00%  3.9630us         1  3.9630us  3.9630us  3.9630us  cudaThreadSynchronize

It seems that cudaMemcpy() is much more slower than expected. It takes around ~ 80ms while cusolverDnDpotrf() takes around ~ 10ms (which is supposed to be most time consuming part).
I noted that cudaMemcpy() will block host execution and such delay may result from unfinished kernal execution. However, I didn’t observe such delay in CUDA samples of cholesky factorization.
I was wondering what could be possible reasons of such abnormally slow behavior. I’m new to CUDA so I might make some obvious mistakes. I appreciate for any possible helps and thanks in advance!

The above code doesn’t help guessing what could be the problem.
I suggest you run your program in a nvvp (Nvidia Visual Profiler) session and it will show you what are the bottle necks in each kernel.

Not necessarily. If you run the VectorAdd sample on a reasonably large dataset, like a 1GB or more, you will see on the profiler that the kernel processes the data faster than the system can copy it back and forth from host to device over PCIe. But again, it is not possible to guess anything else just with this information, so I suppose nvvp has the answer to your question.

Hi Saulocpp,
Thanks for your comments! I will try nvvp to check which part is the bottle neck. I didn’t use it before since nvvp doesn’t generate a graphic timeline for my program for some reasons… I will try to figure it out and might be able to provide more information.

(1) Whenever you profile, make sure to use a fully optimized release build. Accidentally using a debug build is a frequent source of unexpected slowdowns reported in these forums.

(2) Double check the performance of the PCIe link with the sample app bandwidthTest included with CUDA. I recently noticed that my PCIe link was slow because it was running as a x4 link. After unplugging and reinserting my GPU the link now operates at x16 as it should be. No idea what had happened before. For a PCIe gen 3 x16 link, bandwidthTest should report ~ 12GB/sec copy throughput from and to the GPU.

(3) As opposed to isolated kernels that run entirely on the GPU, application-level performance can be limited by serial portions of the code per Amdahl’s Law. The use of a CPU with high single-thread performance is recommended (> 3.5 GHz base clock for the CPU is ideal) to avoid being bottlenecked by serial code.