I have an application that I wrote using CUDA 3.2 to perform some calculations. I did NOT use any asynchronus API calls or anything of the sort in that CUDA app, just normal memcpy, and sequential kernel execution types of operations.
To achieve better performance, in my host code, I have spawned multiple pthreads, and created a new context in those worker threads. The problem is this context takes up a lot of space, so I have been investigating using streams.
What is the best strategy here? If I go back through my CUDA code, and simply add a stream parameter for all the calls to CUDA memory operations and kernels, such that class instance that contains my CUDA code all operates on the SAME stream, it will basically run sequentially as I hve it written now?
Would my worker threads be OK to all push the same cuda context? If they all pushed the same context, and passed a different stream to the CUDA class I have modified to use that stream would that allow me to get concurrency among the worker threads?
Thanks for any tips or resources that you can point me to.
The examples in the SDK seem to be lacking details on some of this stuff.