The syevjBatched functions from the cuSOLVER API appear to synchronize.
Therefore, I cannot overlap its execution with other work using CUDA streams.
I can see that 101 * 4 bytes (if single precision) are transferred to the host if a batch size of 101 is used.
This host memory is not pinned, and the copy therefore synchronizes.
The reason syevjBatched uses a DtoH transfer is likely because the eigenvalue algorithm uses some stopping criteria which is evaluated on the host periodically.
Is there some way to get around this issue?
Perhaps it should be added to the cuSOLVER documentation which functions cannot run asynchronously.
Useful links:
syevjBatched API
syevjBatched example (run this code with Nsight Systems to reproduce the issue)
related question (un-answered)