Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default-

albertazzf · March 6, 2019, 4:13pm

Hi All,
I’ve created a minimalistic sample program that captures the strange behavior of CUDA runtime API.

When compiled using “–default-stream=per-thread”, each host thread should be able to launch its own kernel and wait for result using cudaStreamSynchronize(0), in parallel.

But it turns out to be not true …

In the example code, when you set PARALLEL=false, you can observe perfect concurrency in GPU using visual profiler, as expected

When you set PARALLEL=true, you can observe complete serialization in GPU; when replacing cudaStreamSynchronize(0) by cudaMemcpy(outputh, outputd, sizeof(float), cudaMemcpyDeviceToHost) (and you need to uncomment the cudaMalloc and comment one cudaHostGetDevicePointer), concurrency increased somewhat, but still far from perfect concurrency.

The kernel is producing nothing interesting, so its logic can be ignored :-)
I’m using a single 1080 Ti, CUDA 10.1, Windows 7, Visual Studio Community 2017.

Any thoughts? Thanks for your time!

#include <iostream>
#include <cfloat>
#include <cmath>
#include <thread>
#include <atomic>
#include <vector>
#include <chrono>
#include <cuda_profiler_api.h>

const bool PARALLEL = true;

const int ArrayLength = 128;
const int LoopLength = 10240000;
const int ThreadsCount = 28;

std::atomic<bool> ready(false);

__global__ void kernel_inference(const float* __restrict__ input, float* __restrict__ output)
{
	for(int x=threadIdx.x;x<ArrayLength;x+=blockDim.x)
	{
		float tmp = input[x];
		for(int i=0;i<LoopLength;i++)
		{
			tmp = sinf(tmp*tmp+tmp+1.0f);
		}
		atomicAdd(output, tmp);
	}
}

struct Portal
{
	cudaStream_t stream;
	float* inputh;
	float* inputd;
	float* outputh;
	float* outputd;

	Portal()
	{
		// malloc
		cudaHostAlloc(&inputh, ArrayLength*sizeof(float), cudaHostAllocMapped);
		cudaHostAlloc(&outputh, sizeof(float), cudaHostAllocMapped);
		//cudaMalloc(&outputd, sizeof(float));
		// get pointer
		cudaHostGetDevicePointer(&inputd, inputh, 0);
		cudaHostGetDevicePointer(&outputd, outputh, 0);
		// stream
		if (!PARALLEL) cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
	}

	~Portal()
	{
		cudaFreeHost(inputh);
		cudaFreeHost(outputh);
		//cudaFree(outputd);
		if (!PARALLEL) cudaStreamDestroy(stream);
	}

	float query(float v)
	{
		for(int i=0;i<ArrayLength;i++)
			inputh[i] = v+i;
		*outputh = 0;
		kernel_inference<<<1,128,0,PARALLEL?0:stream>>>(inputd, outputd);
		//if (PARALLEL) cudaMemcpy(outputh, outputd, sizeof(float), cudaMemcpyDeviceToHost);
		if (PARALLEL) cudaStreamSynchronize(0);
		return PARALLEL ? *outputh : 0;
	}
};

void thread_main(int t) 
{
	Portal portal;

	while (!ready)
	{
		std::this_thread::sleep_for(std::chrono::milliseconds(1));
	}

	std::cout<<portal.query(t)<<std::endl;
}

int main(int argc, char** argv)
{
	std::vector<std::unique_ptr<std::thread>> threads;
	std::vector<std::unique_ptr<Portal>> portals;

	if (PARALLEL)
	{
		for(int t=0;t<ThreadsCount;t++)
			threads.push_back(std::make_unique<std::thread>(thread_main, t));
	}
	else
	{
		for(int t=0;t<ThreadsCount;t++)
			portals.push_back(std::make_unique<Portal>());
	}

	cudaDeviceSynchronize();

	ready = true;

	if (PARALLEL)
	{
		for(int t=0;t<threads.size();t++)
			threads[t]->join();
	}
	else
	{
		for(int t=0;t<portals.size();t++)
			portals[t]->query(t);

		for(int t=0;t<portals.size();t++)
		{
			cudaStreamSynchronize(portals[t]->stream);
			std::cout<<*(portals[t]->outputh)<<std::endl;
		}
	}
}

compilation command

CALL "C:\Program Files (x86)\Microsoft Visual Studio017\Community\VC\Auxiliary\Build\vcvars64.bat"

nvcc --gpu-architecture=sm_61 --default-stream=per-thread --optimize=3 -Xcompiler "/wd4819" --x=cu  test.cpp --use_fast_math --library=cuda,cudart_static --library-path="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\lib\x64"  --output-file test.exe

cbuchner1 · March 6, 2019, 9:36pm

Is there any notable difference using cudaMemcpyAsync() instead of cudaMemcpy() (ideally on page locked memory) ?

albertazzf · March 7, 2019, 2:20am

Well the purpose of cudaStreamSynchronize/cudaMemcpy here is synchronization to get kernel result. I’ve tried events & stream callback, still get extreme serialization. cudaMemcpy produces the best concurrency among these choices in the multithreading setup.

Robert_Crovella · March 8, 2019, 6:36pm

As I just mentioned on your stack overflow posting, I see the behavior you expect on linux. Concurrent kernel launches with PARALLEL=true and --default-stream=per-thread I acknowledge you see a difference in behavior on windows (WDDM). I generally advise against trying to achieve much in the way of concurrency on windows WDDM, because the WDDM model gets in the way.

I haven’t tried it on windows.

If the windows behavior is a concern and you don’t like using the method you already found that works, then my suggestion would be to file a bug at developer.nvidia.com. There is a sticky post at the top of this forum that details the procedure.

albertazzf · March 8, 2019, 6:46pm

@Robert_Crovella. Thanks! I saw your stack overflow reply already. Sorry for the double posting ~

My scenario is a parallel tree search problem so it is not convenient nor efficient to post queries to a central server thread. I guess i can only hope for a future version of CUDA or WDDM to work around this problem, TCC cards are too expensive for me.

Btw, I already filed a bug to nvidia :-).

chao.jia · June 11, 2021, 9:37pm

I profiled the multi-threading example (with per-thread default stream) on this page: GPU Pro Tip: CUDA 7 Streams Simplify Concurrency on Windows 10 version 21H1, RTX 2070, cuda 11.2, VS 2019 (16.7). The problem of kernels getting serialized can be solved by enabling Hardware-accelerated GPU scheduling.

albertazzf · June 13, 2021, 2:26pm

Great news, thanks for the information!
I’m still on Win7, will give Win10 a try later.

Topic		Replies	Views
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5548	April 28, 2012
My streams are not running concurrently CUDA Programming and Performance	7	1775	March 6, 2018
cudaMemcpyAsync CUDA Programming and Performance	10	20735	October 16, 2015
Cannot achieve kernel concurrency CUDA Programming and Performance	1	493	May 20, 2015
Cannot get any stream parallelism. CUDA Programming and Performance	13	1280	December 31, 2019
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1457	September 14, 2017
Streams concurrency bad performance CUDA Programming and Performance	3	2029	June 13, 2012
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2108	February 5, 2020
4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour. CUDA Programming and Performance	5	13389	March 10, 2011
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	799	September 11, 2016

Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default-

compilation command

Related topics