Hi. I’d so much appreciate if you give an answer to my question. I build my CUDA kernel code that processes data stored in a shared array. I normally obtain the successful processing results when I run my kernel code in the default stream, but if I run it like this:
, my code execution looses synchronization, and I’m stuck and have no idea how to properly synchronize it to run in stream different than the default device thread. The chunk of code above is invoked from the other kernel code since I’m using dynamic parallelism.
Can you tell my in general, what’s the reason for this code to produce incorrect data processing results when run in stream being previously created ???
Note that by default the default stream is treated differently from any other stream (although this can be changed by compiling with the “–default-stream per-thread” compilation flag, or by defining CUDA_API_PER_THREAD_DEFAULT_STREAM before including any CUDA headers (which in itself is tricky when compiling using nvcc, as the CUDA headers are automatically included at the top of any CUDA code file)).
Thanks for you reply and I’ve got a couple of more questions: how to compile with --default-stream per-thread compilation flag on VSS2015 ? what’s the difference between synchronization in default stream and stream created using cudaStreamCreateWithFlags routine ?
Thanks for reply. You know I have this problem because I need using streams because in my code I invoke some kernels recursively, which doesn’t work for me unless I’m using streams. Probably, you know how to invoke kernels recursively not using streams ??
And one more question: to be more closely to the problem I still encounter, can I submit you my entire code to analyze, not publishing it to the forum ??