Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default-...

Hi All,
I’ve created a minimalistic sample program that captures the strange behavior of CUDA runtime API.

When compiled using “–default-stream=per-thread”, each host thread should be able to launch its own kernel and wait for result using cudaStreamSynchronize(0), in parallel.

But it turns out to be not true …

In the example code, when you set PARALLEL=false, you can observe perfect concurrency in GPU using visual profiler, as expected

When you set PARALLEL=true, you can observe complete serialization in GPU; when replacing cudaStreamSynchronize(0) by cudaMemcpy(outputh, outputd, sizeof(float), cudaMemcpyDeviceToHost) (and you need to uncomment the cudaMalloc and comment one cudaHostGetDevicePointer), concurrency increased somewhat, but still far from perfect concurrency.

The kernel is producing nothing interesting, so its logic can be ignored :-)
I’m using a single 1080 Ti, CUDA 10.1, Windows 7, Visual Studio Community 2017.

Any thoughts? Thanks for your time!

#include <iostream>
#include <cfloat>
#include <cmath>
#include <thread>
#include <atomic>
#include <vector>
#include <chrono>
#include <cuda_profiler_api.h>

const bool PARALLEL = true;

const int ArrayLength = 128;
const int LoopLength = 10240000;
const int ThreadsCount = 28;

std::atomic<bool> ready(false);

__global__ void kernel_inference(const float* __restrict__ input, float* __restrict__ output)
	for(int x=threadIdx.x;x<ArrayLength;x+=blockDim.x)
		float tmp = input[x];
		for(int i=0;i<LoopLength;i++)
			tmp = sinf(tmp*tmp+tmp+1.0f);
		atomicAdd(output, tmp);

struct Portal
	cudaStream_t stream;
	float* inputh;
	float* inputd;
	float* outputh;
	float* outputd;

		// malloc
		cudaHostAlloc(&inputh, ArrayLength*sizeof(float), cudaHostAllocMapped);
		cudaHostAlloc(&outputh, sizeof(float), cudaHostAllocMapped);
		//cudaMalloc(&outputd, sizeof(float));
		// get pointer
		cudaHostGetDevicePointer(&inputd, inputh, 0);
		cudaHostGetDevicePointer(&outputd, outputh, 0);
		// stream
		if (!PARALLEL) cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

		if (!PARALLEL) cudaStreamDestroy(stream);

	float query(float v)
		for(int i=0;i<ArrayLength;i++)
			inputh[i] = v+i;
		*outputh = 0;
		kernel_inference<<<1,128,0,PARALLEL?0:stream>>>(inputd, outputd);
		//if (PARALLEL) cudaMemcpy(outputh, outputd, sizeof(float), cudaMemcpyDeviceToHost);
		if (PARALLEL) cudaStreamSynchronize(0);
		return PARALLEL ? *outputh : 0;

void thread_main(int t) 
	Portal portal;

	while (!ready)


int main(int argc, char** argv)
	std::vector<std::unique_ptr<std::thread>> threads;
	std::vector<std::unique_ptr<Portal>> portals;

		for(int t=0;t<ThreadsCount;t++)
			threads.push_back(std::make_unique<std::thread>(thread_main, t));
		for(int t=0;t<ThreadsCount;t++)


	ready = true;

		for(int t=0;t<threads.size();t++)
		for(int t=0;t<portals.size();t++)

		for(int t=0;t<portals.size();t++)

compilation command

CALL "C:\Program Files (x86)\Microsoft Visual Studio017\Community\VC\Auxiliary\Build\vcvars64.bat"

nvcc --gpu-architecture=sm_61 --default-stream=per-thread --optimize=3 -Xcompiler "/wd4819" --x=cu  test.cpp --use_fast_math --library=cuda,cudart_static --library-path="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\lib\x64"  --output-file test.exe


Is there any notable difference using cudaMemcpyAsync() instead of cudaMemcpy() (ideally on page locked memory) ?

Well the purpose of cudaStreamSynchronize/cudaMemcpy here is synchronization to get kernel result. I’ve tried events & stream callback, still get extreme serialization. cudaMemcpy produces the best concurrency among these choices in the multithreading setup.

As I just mentioned on your stack overflow posting, I see the behavior you expect on linux. Concurrent kernel launches with PARALLEL=true and --default-stream=per-thread I acknowledge you see a difference in behavior on windows (WDDM). I generally advise against trying to achieve much in the way of concurrency on windows WDDM, because the WDDM model gets in the way.

I haven’t tried it on windows.

If the windows behavior is a concern and you don’t like using the method you already found that works, then my suggestion would be to file a bug at There is a sticky post at the top of this forum that details the procedure.

@Robert_Crovella. Thanks! I saw your stack overflow reply already. Sorry for the double posting ~

My scenario is a parallel tree search problem so it is not convenient nor efficient to post queries to a central server thread. I guess i can only hope for a future version of CUDA or WDDM to work around this problem, TCC cards are too expensive for me.

Btw, I already filed a bug to nvidia :-).