I am trying to implement per-thread default stream concurrency in my application. For starters, I tried just compiling and running Mark Harris’ multi-threading example. I tried both the #define option and the compile --default-stream -per-thread option. I tried this with both vs2017 v15.4 and with nvcc from the command line. I tried both with a Windows version of pthreads and C++11 std::threads.
However, I do not see concurrency using nvvp to profile the program. “cudaStreamSynchronize(0)” seems to synchronize all default streams instead of just the specific host thread’s per-thread default stream.
I tried on a Windows 7 machine with a Quadro P2000 card as well as a Windows 10 machine with an M2200 card. Has anyone else encountered these issues and have found fixes? What is the proper way to implement per-thread default streams on Windows?
Mark Harris’ tutorial: https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/