It seems all host function launched by cudaLaunchHostFunc within different stream is executed in single thread sequentially.
I don’t find any runtime API to configure CUDA to use a thread pool.
Anyway to do this?
It seems all host function launched by cudaLaunchHostFunc within different stream is executed in single thread sequentially.
I don’t find any runtime API to configure CUDA to use a thread pool.
Anyway to do this?