Why Sleep blocking all cuda streams?

demo code:

#pragma comment(lib, "winmm.lib")
extern "C" void __stdcall Sleep(int dwMilliseconds);

void __global__ kernel(char* in)
{
	char* pIn = in + blockDim.x * blockIdx.x + threadIdx.x * 100;
	for (int i = 0; i < 50; i++)
		pIn[i] = i;
}

int main(void)
{
	int buffer_size = 1024 * 1024 * 100;
	char* pCpuBuffer;
	char* pCudaBuffer[2];
	cudaStream_t s[2];
	for (int i = 0; i < 2; i++)
	{
		cudaStreamCreate(&s[i]);
		cudaMalloc(&pCudaBuffer[i], buffer_size);
	}
	cudaMallocHost(&pCpuBuffer, buffer_size);

	for (int i = 0; i < 6; i++)
	{
		cudaMemcpyAsync(pCudaBuffer[i % 2], pCpuBuffer, buffer_size, cudaMemcpyHostToDevice, s[i % 2]);
		kernel << <1024, 1024, 0, s[i % 2] >> > (pCudaBuffer[i % 2]);
	}

	Sleep(100);
	cudaDeviceSynchronize();
	return 0;
}

The cuda specification mentions that it allows the cpu to process data in parallel, so I’m assuming that sleep’s code is calculating the data. There should be no blocking

win10 x64
1050ti
11.6 sdk
vs2022

There is blocking for any host code activity. This includes calls to library routines such as cudaDeviceSynchronize(). Coupled with that, sleep will likely have an effect on wddm batching. You can google for that (“cuda wddm batching”). There isn’t much you can do about wddm batching, except to switch your work to TCC mode. You might also experiment with the windows setting for “GPU Hardware scheduling”, but I don’t know if it will have any effect on this case. see here

I have checked my system, the Hardware Scheduling is not supported, and my video card needs to output images, so it cannot be switched to TCC mode.

So:
Is sleep or other code blocking cuda streams a known problem?

And is cudaStreamSynchronize/ cudaEventSynchronize blocking other streams a known problem? link here

And there’s no solution?

wddm batching is a known issue.

the thing that will cause forward progress in your code in the presence of wddm batching is the issuance of a cudaDeviceSynchronize() call (or other blocking sync call, such as cudaMemcpy) i.e. actually allowing the host code progress to reach the point of making the library call associated with cudaDeviceSynchronize(). If you don’t do that, yes, its possible that other CUDA work may “never” actually get issued to the GPU.

This issue has been brought up various times, in various forum postings, and people have struggled to get the work to actually issue to the GPU. There isn’t any defined method to get the wddm queue to empty, and even if there were, calling sleep would interrupt that.

Ok, thank you very much for your help.

In addition, could you tell me why I call cudaStreamSynchronize/cudaEventSynchronize waiting for all streams complete, have suggested solution to this problem?

I want to wait only for one stream to complete, not for all streams to complete.

link here