Asynchronous cuSolverDn functions


I’m trying to use some cuSolverDn functions (QR, SVD) but can’t seem to get them to actually run asynchronous. The basic mode of operation is

  1. cudaStreamCreate() + cusolverDnSetStream()
  2. allocate GPU memory (including workspace, parameters, etc.)
  3. start timer
  4. cudaMemcpyAsync( HtoD )
  5. cusolverDnDgeqrf() / cusolverDnDgesvdj()
  6. cudaMemcpyAsync( DtoH )
  7. cudaLaunchHostFunc()
  8. print timer
  9. cudaStreamSynchronize()
  10. print timer

All examples are tested on a TitanV GPU.

The printed times in 8/10 do not differ indicating that the cuSolver functions run synchronous, e.g. 3.605s vs 3.637s for a particular problem size with SVD.

If I replace QR/SVD from cuSolver in 5. with cublasDgemm, I get 2.2e-04s (8) and 6.6e-01s (10). So this seems to run asynchronous.

Under what conditions do cuSolver functions really run asynchronous? Or is the phrase “prefer to keep asynchronous execution” from the docs an indication that many functions actually block?


Can you provide reproducer code? Some cuSolver functions are blocking, if it’s more heterogeneous (instead of strictly GPU)

Also, you might trying analyzing with Nsight System to get more insight.