Cuda stream problem in windows

hello all!
i have the following issue in windows operating system with cuda 11.5 which doesn’t appear in linux,
the problem is with cuda streams using rtx5000 card :
when running a kernel or some asynchronous copy operation(cudaMemcpyAsync) on a created stream, i see that the kernel/copy doesn’t start running asynchronously until i get to some synchronization or query operation for example assume that my kernel runs for 10msec:

my_kernel<<<… ,my_stream>>> //issue some kernel on my_stream
cudaEventRecord(my_event,my_stream);
Sleep(10);//sleep for 10 msec which is the kernel running time
int x = 0;
while(cudaEventQuery(my_event)!= cudaSuccess)
{
x++;
}
After running these commands i expected that my_event should be already signaled and x value should be 0,
but i see that it is not the case and x value is large…
this phenomenon is fixed only if i add the following command
before sleep(10) :cudaEventQuery(my_event), i see that only then the kernel starts running asynchronously and the x value is 0 in the end as expected… this problem doesn’t appear with the same setup in linux…
please help me understand it…

This is WDDM command batching. It is described in a number of articles on various forums. When I did a google search on “wddm command batching” this was the first hit. You’ll find other instructive posts as well. Here is another example.