Kernels won't run in parallel

I have the following sequence o kernel calls in my host code:

kernel_A<<<gridDim, blockDim, sm_size>>>(params);
while (cond) {
cudaMemcpy(d_x, h_x, x_size);
cudaMemset(d_y, 0, y_size) );
// cudaDeviceSynchronize();
kernel_B<<<gridDim, blockDim, sm_size, streams[0]>>>(params);
cudaEventRecord(event, streams[0]) );
// cudaDeviceSynchronize();
kernel_C<<<gridDim, blockDim, sm_size, streams[1]>>>(params);
cudaStreamWaitEvent(streams[1], event, 0));
kernel_D<<<gridDim, blockDim, sm_size, streams[1]>>>(params);
kernel_E<<<gridDim, blockDim, sm_size, streams[0]>>>(params);
}

The problem is that neither pair of kernels (B+C or D+E) launch in parallel, but ONLY ON THE FIRST PASS THROUGH THE LOOP!!! On all subsequent passes the parallel execution succeeds, according to the visual profiler (the same goes for release printouts). There is no lack of GPU resources. The GPU is GTX 480 with compute capability 2.0. It doesn’t help if I uncomment sync points or even remove the event handling. The only reason could be the kernel A before the loop which launches in the default stream, but even if I put it into one, the issue remains.

What could possibly be the problem? I believe all of the conditions for parallel kernel execution are satisfied and there should be no implicit synchronization or queue blocking.

I don’t see anything that will prevent overlap either, though I would try to use cudaMemcpyAsync() & cudaMemsetAsync() if possible since cudaMemcpy() and cudaMemset() will cause all subsequent kernels to wait on them.

If it overlaps on all iterations except the 1st, isn’t that good enough?

Also, are you aware of the 1 issue queue limitation on Fermi that can cause even parallel streams to serialize? That doesn’t seem to be a problem here, since you seem to follow the best practice of issuing kernels in breadth-first order (before issuing dependent kernels)

Also, your cudaStreamWaitEvent(streams[1], event, 0) seems to be a no-op because you never record such an event

Yes, I am aware of the kernel and memcpy queue limitations. I do record the event in the first stream before launching kernel C.

Each subsequent pass through the loop operates on a simpler subproblem, which is why the first pass is the most important. But even if I artificially reduce the initial problem size, there is no parallelization. The resource limitation is definitely not an issue.

However, in another version of the host code, where the kernel A and B functionality is included into kernels C and D (with some redundance, which is why I later extracted them), the parallel kernel execution succeeds in the first loop pass. So there must be some problem with launching kernels A and B before kernels C and D that prevents their parallelization. Even with sync points inbetween. There should be no dependency to prevent kernel D from launching immediately after kernel C.
Could this be some sort of bug? Is there some way to determine what is blocking the parallel kernel execution?

Could it be that by chance kernel A is exactly filling the GPU resources, so that no other kernel can execute in parallel?

How many blocks are you launching and what does the Occupancy calculator report for the number of blocks/SM?