Per-thread Default Stream Concurrency

Hi,

I am trying to implement per-thread default stream concurrency in my application. For starters, I tried just compiling and running Mark Harris’ multi-threading example. I tried both the #define option and the compile --default-stream -per-thread option. I tried this with both vs2017 v15.4 and with nvcc from the command line. I tried both with a Windows version of pthreads and C++11 std::threads.

However, I do not see concurrency using nvvp to profile the program. “cudaStreamSynchronize(0)” seems to synchronize all default streams instead of just the specific host thread’s per-thread default stream.

I tried on a Windows 7 machine with a Quadro P2000 card as well as a Windows 10 machine with an M2200 card. Has anyone else encountered these issues and have found fixes? What is the proper way to implement per-thread default streams on Windows?

Mark Harris’ tutorial: https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

WDDM command batching might be getting in your way on Windows. I generally wouldn’t try to architect complex concurrency scenarios in a WDDM regime. It can be done, but it’s generally more difficult and often requires unusual tricks. The TCC driver model was created to allow developers to avoid the WDDM issues on windows. You might try re-running your experiements in TCC mode on one of your GPUs. It may necessitate putting an additional GPU in the system, since TCC mode GPUs cannot host a display.

Or you could try it on linux.

txbob,

Thanks for your suggestion. I am having some IT issues with multiple GPUs installed, so I have not yet tried switching to TCC mode. When I am able to test TCC mode, I will post an update.

While IT and I work those hardware issues out, I wanted to try classic concurrency by explicitly using non-default streams. Even then, I get the behavior outlined in the previous post. To me, this does not seem like a per-thread default stream issue, but a core problem with stream management.

Does this seem to be the case, in your experience?

Thank you.

My relevant snippet of code:

const int N = 1 << 20;

__global__ void kernel(float *x, int n)
{
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	for (int i = tid; i < n; i += blockDim.x * gridDim.x) 
        {
		x[i] = sqrt(pow(3.14159, i));
	}
}

void launch_kernel(void *dev, void *host)
{
	cudaStream_t stream; 
	cudaStreamCreate(&stream);

	kernel << <1, 64, 0, stream>> >((float *)dev, N);

	cudaStreamSynchronize(stream);

	return;
}

int main()
{
	const int num_threads = 8;

	std::thread threads[num_threads];

	float *host[num_threads];
	float *dev[num_threads];

	for (int i = 0; i < num_threads; ++i)
	{
		cudaMalloc((void **)&dev[i], N * sizeof(float));

		cudaMallocHost((void **)&host[i], N * sizeof(float));
		
		memset(host[i], 0, N * sizeof(float));
	}

	cudaStream_t streams[num_threads];
	for (int i = 0; i < num_threads; ++i)
	{
		threads[i] = std::thread(&launch_kernel, dev[i], host[i]);
	}


	for (int i = 0; i < num_threads; ++i)
	{
		if (threads[i].joinable())
			threads[i].join();
	}

	cudaDeviceReset();

	return 0;
}