Why my kernel code looses synchronization when running it in stream different from default ?

Invader0x7F · November 14, 2016, 1:42pm

Hi. I’d so much appreciate if you give an answer to my question. I build my CUDA kernel code that processes data stored in a shared array. I normally obtain the successful processing results when I run my kernel code in the default stream, but if I run it like this:

cudaStream_t s;
	cudaErrorCheck(cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking));
	cdpMyKernel <InpType, CntType> << < 1, \
		nChunkSize, 0, s >> > (ppdChunkPtr, ppdCountPtr[threadIdx.x], nChunkSize);
	cudaErrorCheck(cudaStreamDestroy(s));

, my code execution looses synchronization, and I’m stuck and have no idea how to properly synchronize it to run in stream different than the default device thread. The chunk of code above is invoked from the other kernel code since I’m using dynamic parallelism.

Can you tell my in general, what’s the reason for this code to produce incorrect data processing results when run in stream being previously created ???

Thanks. Waiting for your reply.

tera · November 14, 2016, 2:09pm

It is impossible to tell just from the code snippet itself, but most likely you are relying on the implicit synchonization of the default stream somewhere.

Note that by default the default stream is treated differently from any other stream (although this can be changed by compiling with the “–default-stream per-thread” compilation flag, or by defining CUDA_API_PER_THREAD_DEFAULT_STREAM before including any CUDA headers (which in itself is tricky when compiling using nvcc, as the CUDA headers are automatically included at the top of any CUDA code file)).

Invader0x7F · November 14, 2016, 2:15pm

Thanks for you reply and I’ve got a couple of more questions: how to compile with --default-stream per-thread compilation flag on VSS2015 ? what’s the difference between synchronization in default stream and stream created using cudaStreamCreateWithFlags routine ?

Invader0x7F · November 14, 2016, 2:28pm

And one more question: is it possible to send you a fragment of code by not publishing it in the forum’s subject ?

tera · November 14, 2016, 2:29pm

The special role of the default stream explained in the CUDA Programming Guide just above the section I’ve linked to in my previous post. Essentially it automatically synchronises with any other stream / CUDA operation.

I have no experience with VSS2015 myself, so unfortunately I can’t help you there. But I’m sure others in this forum can.

Invader0x7F · November 14, 2016, 2:39pm

Thanks for reply. You know I have this problem because I need using streams because in my code I invoke some kernels recursively, which doesn’t work for me unless I’m using streams. Probably, you know how to invoke kernels recursively not using streams ??

tera · November 14, 2016, 3:16pm

Invoking kernels recursively requires Dynamic Parallelism. Can you elaborate why this only works for your problem together with using streams?

Invader0x7F · November 14, 2016, 3:20pm

Yes, sure. Because otherwise if I’m not using streams while running kernels recursively I get error 4 on launching kernel.

Invader0x7F · November 14, 2016, 3:21pm

And one more question: to be more closely to the problem I still encounter, can I submit you my entire code to analyze, not publishing it to the forum ??

njuffa · November 14, 2016, 4:39pm

If you wish to hire a consultant, I would suggest posting in the sub-forum “GPU Computing Jobs” next door.

Topic		Replies	Views
Concurrency about default stream CUDA Programming and Performance	3	2839	March 23, 2015
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2734	February 5, 2020
Multi threaded issue with --default-stream per-thread CUDA Programming and Performance	3	1012	November 20, 2018
--default-stream per-thread question CUDA Programming and Performance	2	821	August 22, 2018
Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default- CUDA Programming and Performance	7	3037	October 12, 2021
Cuda nvcc default stream per-thread doesn't seem to be working CUDA Programming and Performance	0	774	August 10, 2020
CUDA streams, default stream zero CUDA Programming and Performance	2	1234	September 10, 2013
Streams not running conccurently CUDA Programming and Performance	4	110	May 22, 2025
CUDA non-default stream synchronization CUDA Programming and Performance jetson-orin	4	470	October 30, 2024
Per-thread Default Stream Concurrency CUDA Programming and Performance	2	2236	February 10, 2018

Why my kernel code looses synchronization when running it in stream different from default ?

Related topics