Per-thread Default Stream Concurrency

v.nguyen · February 8, 2018, 11:50pm

Hi,

I am trying to implement per-thread default stream concurrency in my application. For starters, I tried just compiling and running Mark Harris’ multi-threading example. I tried both the #define option and the compile --default-stream -per-thread option. I tried this with both vs2017 v15.4 and with nvcc from the command line. I tried both with a Windows version of pthreads and C++11 std::threads.

However, I do not see concurrency using nvvp to profile the program. “cudaStreamSynchronize(0)” seems to synchronize all default streams instead of just the specific host thread’s per-thread default stream.

I tried on a Windows 7 machine with a Quadro P2000 card as well as a Windows 10 machine with an M2200 card. Has anyone else encountered these issues and have found fixes? What is the proper way to implement per-thread default streams on Windows?

Mark Harris’ tutorial: [url]https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/[/url]

Robert_Crovella · February 9, 2018, 12:01am

WDDM command batching might be getting in your way on Windows. I generally wouldn’t try to architect complex concurrency scenarios in a WDDM regime. It can be done, but it’s generally more difficult and often requires unusual tricks. The TCC driver model was created to allow developers to avoid the WDDM issues on windows. You might try re-running your experiements in TCC mode on one of your GPUs. It may necessitate putting an additional GPU in the system, since TCC mode GPUs cannot host a display.

Or you could try it on linux.

v.nguyen · February 10, 2018, 1:07am

txbob,

Thanks for your suggestion. I am having some IT issues with multiple GPUs installed, so I have not yet tried switching to TCC mode. When I am able to test TCC mode, I will post an update.

While IT and I work those hardware issues out, I wanted to try classic concurrency by explicitly using non-default streams. Even then, I get the behavior outlined in the previous post. To me, this does not seem like a per-thread default stream issue, but a core problem with stream management.

Does this seem to be the case, in your experience?

Thank you.

My relevant snippet of code:

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	for (int i = tid; i < n; i += blockDim.x * gridDim.x) 
        {
		x[i] = sqrt(pow(3.14159, i));
	}
}

void launch_kernel(void *dev, void *host)
{
	cudaStream_t stream; 
	cudaStreamCreate(&stream);

	kernel << <1, 64, 0, stream>> >((float *)dev, N);

	cudaStreamSynchronize(stream);

	return;
}

int main()
{
	const int num_threads = 8;

	std::thread threads[num_threads];

	float *host[num_threads];
	float *dev[num_threads];

	for (int i = 0; i < num_threads; ++i)
	{
		cudaMalloc((void **)&dev[i], N * sizeof(float));

		cudaMallocHost((void **)&host[i], N * sizeof(float));
		
		memset(host[i], 0, N * sizeof(float));
	}

	cudaStream_t streams[num_threads];
	for (int i = 0; i < num_threads; ++i)
	{
		threads[i] = std::thread(&launch_kernel, dev[i], host[i]);
	}


	for (int i = 0; i < num_threads; ++i)
	{
		if (threads[i].joinable())
			threads[i].join();
	}

	cudaDeviceReset();

	return 0;
}

Topic		Replies	Views
Cuda nvcc default stream per-thread doesn't seem to be working CUDA Programming and Performance	0	805	August 10, 2020
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	3014	February 5, 2020
Concurrency about default stream CUDA Programming and Performance	3	2865	March 23, 2015
Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default- CUDA Programming and Performance	6	3085	June 13, 2021
--default-stream per-thread question CUDA Programming and Performance	2	839	August 22, 2018
Multi threaded issue with --default-stream per-thread CUDA Programming and Performance	3	1041	November 20, 2018
Default Stream per Thread - Driver API CUDA Programming and Performance	2	2106	August 11, 2019
"--default-stream per-thread" on multi-GPU environment not working as expected? CUDA Programming and Performance	1	299	September 19, 2023
Why my kernel code looses synchronization when running it in stream different from default ? CUDA Programming and Performance	9	1010	November 14, 2016
How to set --default-stream per-thread in nsight Nsight Eclipse Edition cuda	4	1015	July 10, 2020

Per-thread Default Stream Concurrency

Related topics