I attempted to make a more optimal framework for launching GPU calls then my previous synchronous copy memory to device, call kernel, copy memory from device implementation. As such I started using streams so I could use async calls. I wanted a different thread spinning checking for completion so I tried to use cudaStreamQuery to check if all the calls had completed but from a seperate thread it always returns cudaErrorInvalidResourceHandle when I pass it the stream handle.
If I call cudaStreamQuery from the same thread directly after lining up all the other calls it works fine but if I block everything else and querey it from a different thread it fails. It throws a first chance exception when called from another thread and when I cast the error memory address to an int I see it’s a cudaErrorInitializationError. I don’t understand why this, are there some rules about CPU calling threads? Any help you can provide is much appreciated. If you would suggest some alternate pattern for optimal performance, I’m also open to suggestion.