CUDA vs ATI Stream comparison

One thing I came to dislike about CUDA is the idea of not providing some way to synchronize over blocks.
This is a big limitation because the overhead for re-calling a kernel thousands of times in a loop is too huge.
I am going to check ATI stream. CUDA’s stream processors are 100% independent and blind about the other
processors on the same chip. They can’t even communicate between themselves. The reasoning given by NVIDIA
about this limitation is not satisfactory to me.

It is exactly the same with ATI. Also, you can communicate/synchronize between blocks via atomics, but it is something one should avoid. What kind of problem do you have where you actually need this?

I can tell you that i have a linear solver in a real time physics engine which requires 300 kernels for every time step, all of them together take about 8 ms on an normal data set. if you launch the kernels back to back without any other operation the launch over head is very minimal (3 micro seconds if im not mistaken). Of course a global sync would be great, since i would be able to eliminate some of the temporary data that i now have to store to global memory and read back for the next kernel.