Why my kernel code looses synchronization when running it in stream different from default ?

Hi. I’d so much appreciate if you give an answer to my question. I build my CUDA kernel code that processes data stored in a shared array. I normally obtain the successful processing results when I run my kernel code in the default stream, but if I run it like this:

cudaStream_t s;
	cudaErrorCheck(cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking));
	cdpMyKernel <InpType, CntType> << < 1, \
		nChunkSize, 0, s >> > (ppdChunkPtr, ppdCountPtr[threadIdx.x], nChunkSize);

, my code execution looses synchronization, and I’m stuck and have no idea how to properly synchronize it to run in stream different than the default device thread. The chunk of code above is invoked from the other kernel code since I’m using dynamic parallelism.

Can you tell my in general, what’s the reason for this code to produce incorrect data processing results when run in stream being previously created ???

Thanks. Waiting for your reply.

It is impossible to tell just from the code snippet itself, but most likely you are relying on the implicit synchonization of the default stream somewhere.

Note that by default the default stream is treated differently from any other stream (although this can be changed by compiling with the “–default-stream per-thread” compilation flag, or by defining CUDA_API_PER_THREAD_DEFAULT_STREAM before including any CUDA headers (which in itself is tricky when compiling using nvcc, as the CUDA headers are automatically included at the top of any CUDA code file)).

Thanks for you reply and I’ve got a couple of more questions: how to compile with --default-stream per-thread compilation flag on VSS2015 ? what’s the difference between synchronization in default stream and stream created using cudaStreamCreateWithFlags routine ?

And one more question: is it possible to send you a fragment of code by not publishing it in the forum’s subject ?

The special role of the default stream explained in the CUDA Programming Guide just above the section I’ve linked to in my previous post. Essentially it automatically synchronises with any other stream / CUDA operation.

I have no experience with VSS2015 myself, so unfortunately I can’t help you there. But I’m sure others in this forum can.

Thanks for reply. You know I have this problem because I need using streams because in my code I invoke some kernels recursively, which doesn’t work for me unless I’m using streams. Probably, you know how to invoke kernels recursively not using streams ??

Invoking kernels recursively requires Dynamic Parallelism. Can you elaborate why this only works for your problem together with using streams?

Yes, sure. Because otherwise if I’m not using streams while running kernels recursively I get error 4 on launching kernel.

And one more question: to be more closely to the problem I still encounter, can I submit you my entire code to analyze, not publishing it to the forum ??

If you wish to hire a consultant, I would suggest posting in the sub-forum “GPU Computing Jobs” next door.