I’ve read the article written by Mark Harris about CUDA 7 Streams Simplify Concurrency and it is really useful to me. So I want to run the sample in the article to see the real performance. My device is GTX970 and I use visual studio 2013 to compile and test the sample.But I’m disappointed to find that the concurrency have not happened.
I’ve added --default-stream per-thread in “CUDA C/C++ -> Command Line” but there was nothing changed.
I’ve also #define the CUDA_API_PER_THREAD_DEFAULT_STREAM in “CUDA C/C++ -> Host -> Preprocessor Definitions” and write the macro on the first line of the source code but there is still nothing changed.
Finally, I replaced cudaStreamCreate function with cudaStreamCreateWithFlags and use cudaStreamNonBlocking as the second parameter， but there was no concurrency at all!
Does anyone know the way to solve this problem?