Concurrency about default stream

I’ve read the article written by Mark Harris about CUDA 7 Streams Simplify Concurrency and it is really useful to me. So I want to run the sample in the article to see the real performance. My device is GTX970 and I use visual studio 2013 to compile and test the sample.But I’m disappointed to find that the concurrency have not happened.
I’ve added --default-stream per-thread in “CUDA C/C++ → Command Line” but there was nothing changed.
I’ve also #define the CUDA_API_PER_THREAD_DEFAULT_STREAM in “CUDA C/C++ → Host → Preprocessor Definitions” and write the macro on the first line of the source code but there is still nothing changed.
Finally, I replaced cudaStreamCreate function with cudaStreamCreateWithFlags and use cudaStreamNonBlocking as the second parameter, but there was no concurrency at all!

Does anyone know the way to solve this problem?

I got it. cudaStreamCreate and cudaMalloc should not be written in the kernel launch loop because these functions need synchronization.

cudaFree() is also a synchronous function. That being said, I didn’t modify any of the source code and I still got overlapping execution. But doing it your way, I got significantly improved overlapped execution.

Thanks for your advice. I guess that something is different in the optimization of compiler.