How is that some eigenvector solver function (e.g. cuSolver cusolverDnCheevj or cusolverDnSsyevd) does hidden cudaMemcpy D2H pageable memory copy which inflicts an implicit synchronization but its batched version does not initiate such copy? (The given stream has been set by cusolverDnSetStream). During Nsight System investigation it can be seen that when using such solver the DMA stream does not run concurrently with the compute stream, which means a huge inefficiency in terms of device occupancy . Where do I find any information about this behavior of the solvers?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| No stream concurrency with cusolverDnDsyevj | 0 | 497 | July 5, 2018 | |
| Streaming cuSolver | 2 | 1688 | June 9, 2015 | |
| syevjBatched cannot be run asynchronously | 1 | 123 | February 10, 2025 | |
| cuSolver stream parallelism | 0 | 700 | March 7, 2018 | |
| Do I need to synchronize the stream / threads after a cusolver call? | 1 | 413 | March 29, 2022 | |
| is there need a streamsynchronize() between kernels and CULA function when use cuda stream? | 1 | 485 | October 2, 2017 | |
| Asynchronous cuSolverDn functions | 1 | 503 | September 14, 2020 | |
| Cuda Driver API and CUSolver internal error | 2 | 2326 | January 20, 2022 | |
| matrixMul skd sample. Where is cudaThreadSynchronize? | 3 | 2029 | December 19, 2009 | |
| cuSolver SVD not overlapping using streams | 7 | 247 | September 19, 2024 |