Shocking cuSolver implicit synchronizations

How is that some eigenvector solver function (e.g. cuSolver cusolverDnCheevj or cusolverDnSsyevd) does hidden cudaMemcpy D2H pageable memory copy which inflicts an implicit synchronization but its batched version does not initiate such copy? (The given stream has been set by cusolverDnSetStream). During Nsight System investigation it can be seen that when using such solver the DMA stream does not run concurrently with the compute stream, which means a huge inefficiency in terms of device occupancy . Where do I find any information about this behavior of the solvers?