cudaStreamQuery() works strangely

I have some problems when using cudaStreamQuery(). Suppose kernel0 will be finished at time:0.0013 s
kernel1 will be finished at time: 0.013s
Since I want to make this problem easier to understand, I simplified the code. For example, in fact, there will not be this while loop without break in my original code.

kernel0<<<…stream[0]>>>(…,);
kernel1<<<…stream[1]>>>(…,);
while(1)
{
kkk++;
if (cudaStreamQuery(stream[0])==cudaSuccess) cout<<"kernel 0 finished at iteration: “<<kkk<<endl;
if (cudaStreamQuery(stream[1])==cudaSuccess) cout<<” kernel 1 finished at iteration: "<<kkk<<endl;
}
Then the results are :
kernel 0 finished at iteration: 1611
kernel 0 finished at iteration: 1612
kernel 0 finished at iteration: 1613
.
.
.
.
kernel 0 finished at iteration: 12133
kernel 1 finished at iteration: 12133
kernel 0 finished at iteration: 12134
kernel 1 finished at iteration: 12134
.
.
.

However, if I changed the order of cudaStreamQuery:
i.e.
kernel0<<<…stream[0]>>>(…,);
kernel1<<<…stream[1]>>>(…,);
while(1)
{
kkk++;
if (cudaStreamQuery(stream[1])==cudaSuccess) cout<<"kernel 1 finished at iteration: “<<kkk<<endl;
if (cudaStreamQuery(stream[0])==cudaSuccess) cout<<” kernel 0 finished at iteration: "<<kkk<<endl;
}
The results are:
kernel 1 finished at iteration: 12133
kernel 0 finished at iteration: 12133
kernel 1 finished at iteration: 12134
kernel 0 finished at iteration: 12134
.
.
.
.

However, in fact, in the second situation, kernel 0 is finished at iteration: 1611, but it was not printed out.

So I am guessing when we make two StreamQuery() of two streams at iteration: 1, since none of the streams have been finished, the streamQuery will wait for stream which is queried first to be finished and then deal with the second streamQuery.

Does anyone have any idea? Thanks!