Behavior of cudaStreamQuery

My understanding is that cudaStreamQuery immediately returns the status of the queried stream. So I wrote the following code. The idea is that I concurrently run two kernels. The smaller kernel (the one with shorter execution time) keeps being invoked until the first invocation of the other kernel finishes. The concurrent execution is guaranteed, as both kernels have a very small number of thread blocks.

kernel1<<<grid1, block1, 0, stream[0]>>>();
kernel2<<<grid2, block2, 0, stream[1]>>>();

while(1) {
  if(cudaStreamQuery(stream[0]) == cudaSuccess) {
    finished[0]  = 1;
    break;
  }

  if(cudaStreamQuery(stream[1]) == cudaSuccess) {
    finished[1]  = 1;
    break;
  }
}

if(finished[0]  == 1) {
  while(cudaStreamQuery(stream[1]) != cudaSuccess) {
    iters[0]++;
    kernel1<<<grid1, block1, 0, stream[0]>>>();

    while(cudaStreamQuery(stream[0]) != cudaSuccess);
  }
}
else if(finished[1]  == 1) {
  while(cudaStreamQuery(stream[0]) != cudaSuccess) {
    iters[1]++;
    kernel2<<<grid2, block2, 0, stream[1]>>>();

    while(cudaStreamQuery(stream[1]) != cudaSuccess);
  }
}

I ran the code on an C2050 card with CUDA 5.0. The behavior is confusing. I made the workload of kernel1 much larger than that of kernel2. So the check of kernel2 (i.e. stream[1]) should return cudaSuccess much earlier in the first while loop. But it turns out that the loop always breaks from the first if statement. Does anybody observe similar behaviors before?