Hello everyone,

I’m dealing with a project where I decided to do Cholesky factorization via cuSolver library. When I test with CUDA samples, a 1000x1000 matrix takes ~10ms. However, when I test it within my project, its performance degraded into ~100ms.

I attached related code as shown below:

```
check_cuda_errors(cudaMemset(d_info, 0, sizeof(int)));
check_cusolver_errors(cusolverDnDpotrf(cusolver_handle, cublas_fillmode, ld, d_covariance, ld, cusolver_buffer, cusolver_buffersize, d_info));
check_cuda_errors(cudaMemcpy(&h_info, d_info, sizeof(int), cudaMemcpyDeviceToHost));
if (h_info != 0)
{
if (screen) fprintf(screen, "Cholesky Factorization failed. Aborted.\n");
if (logfile) fprintf(logfile, "Cholesky Factorization failed. Aborted.\n");
exit(EXIT_FAILURE);
}
```

I profiled my application (above code is repeated for 100 times) with nvprof and below is profiled result:

```
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 64.12% 5.55337s 600 9.2556ms 4.4314ms 16.903ms maxwell_dgemm_128x64_lower_nt
12.91% 1.11784s 4000 279.46us 89.377us 1.5853ms void trsm_right_kernel<double, int=256, int=4, bool=0, bool=0, bool=1, bool=0, bool=0>(cublasTrsmParams<double>, double, double const *, int)
10.07% 871.77ms 1700 512.81us 78.080us 3.8615ms maxwell_dgemm_64x64_lower_nt
5.47% 474.12ms 2000 237.06us 84.097us 1.3376ms void magma_lds128_dgemm_kernel<bool=0, bool=1, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
2.34% 202.37ms 300 674.56us 586.60us 2.3859ms maxwell_dgemm_64x64_nt
1.30% 112.45ms 18800 5.9810us 3.4240us 536.36us void potrf_alg1_cta_lower<double, double, int=16>(int, int, double*, int, int*)
1.03% 89.221ms 17700 5.0400us 3.5200us 168.71us void potrf_alg1_trsm_lower<double, double, int=16>(int, int, int, int, int, double*, int)
1.01% 87.055ms 16400 5.3080us 2.4960us 177.09us void potrf_alg1_syrk_kernel<double, bool=0, bool=1, int=5, int=4, int=4, int=5, int=4>(int, int, double const *, int, double*, int, bool)
0.79% 68.742ms 100 687.42us 676.71us 1.1876ms void trmv_tile_kernel<double, int=64, int=128, int=4, bool=0, bool=0, bool=0>(cublasTrmvTileParams<double>)
0.53% 45.508ms 700 65.011us 33.185us 344.90us void gemm_kernel2x2_core<double, bool=0, bool=0, bool=0, bool=0, bool=1>(double*, double const *, double const *, int, int, int, int, int, int, double*, double*, double, double, int)
0.37% 32.309ms 100 323.09us 310.37us 875.66us void syhemvl_generic_off_diag<double, bool=0, int=32, int=32, int=4, int=4>(int, int, double, double*, double*, int, double*, int, double*, int, int)
0.03% 2.9532ms 1 2.9532ms 2.9532ms 2.9532ms generate_seed_pseudo(__int64, __int64, __int64, curandOrdering, curandStateXORWOW*, unsigned int*)
0.02% 1.3760ms 100 13.759us 13.184us 15.712us void syhemvl_generic_diag<double, bool=0, int=32, int=32, int=4, int=4>(int, int, double, double*, double*, int, double*, int, double, double*, double*, int, int)
0.01% 936.97us 100 9.3690us 8.8640us 10.016us [CUDA memcpy DtoD]
0.01% 536.29us 100 5.3620us 4.7680us 5.9210us void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, __operator_&__(double2 curand_normal_scaled2_double<curandStateXORWOW>(curandStateXORWOW*, normal_args_double_st))>(curandStateXORWOW*, double2*, unsigned long, unsigned long, normal_args_double_st)
0.00% 153.63us 100 1.5360us 1.4720us 1.6640us potrf_alg1_reset_info(int*)
0.00% 139.65us 100 1.3960us 1.3440us 1.6960us potrf_alg1_set_info(int, int, int*)
0.00% 72.929us 100 729ns 672ns 928ns [CUDA memset]
0.00% 60.577us 100 605ns 576ns 1.3440us [CUDA memcpy DtoH]
API calls: 61.40% 13.4332s 62801 213.90us 3.1780us 174.01ms cudaLaunchKernel
37.93% 8.29686s 100 82.969ms 81.321ms 94.789ms cudaMemcpy
0.33% 72.576ms 101 718.58us 682.14us 3.0673ms cudaFree
0.15% 33.376ms 100 333.76us 320.40us 885.59us cudaDeviceSynchronize
0.11% 24.411ms 83402 292ns 72ns 4.9087ms cudaGetLastError
0.03% 6.7764ms 4000 1.6940us 1.2180us 281.31us cudaBindTexture
0.01% 3.0297ms 102 29.703us 20.799us 168.35us cudaMalloc
0.01% 2.7548ms 4000 688ns 432ns 284.74us cudaUnbindTexture
0.01% 1.9823ms 100 19.822us 12.013us 55.253us cudaMemcpy2DAsync
0.01% 1.4287ms 100 14.287us 8.0820us 214.85us cudaMemset
0.00% 3.9630us 1 3.9630us 3.9630us 3.9630us cudaThreadSynchronize
```

It seems that cudaMemcpy() is much more slower than expected. It takes around ~ 80ms while cusolverDnDpotrf() takes around ~ 10ms (which is supposed to be most time consuming part).

I noted that cudaMemcpy() will block host execution and such delay may result from unfinished kernal execution. However, I didn’t observe such delay in CUDA samples of cholesky factorization.

I was wondering what could be possible reasons of such abnormally slow behavior. I’m new to CUDA so I might make some obvious mistakes. I appreciate for any possible helps and thanks in advance!