It seems all host function launched by cudaLaunchHostFunc within different stream is executed in single thread sequentially.
I don’t find any runtime API to configure CUDA to use a thread pool.
Anyway to do this?
1 Like
Hello @SparkHu , Any luck with this ?
a related thread: cudaLaunchHostFunc API example