Is there some mechanism for communication between threads during the same kernel invocation, if the threads are running on different streaming multiprocessors ?
AFAIK there are no documented or recommended ways to do this.
However, there were some attempts to do inter-block communications on this forum, try searching.
What BierdnA (everything is reversible) said is true. Therez no documented way. The only way suggested is to launch your kernel phase by phase (one kernel per phase) using global memory to hold intermediate phase data.
This forum has a thread that speaks about “Block Synchronisation” where in they have discussed an algorithm to do block synchronisation using atomic-operations. But then, atomic operations r available only in 1.1 compute capable devices. And, I am not sure if this algo is NVIDIA recommended. Because their manual never talks about all these. So, its better not to do block synchronization.
I would rather prefer to do things in CPU rather than doing block synchronisation with the help of global memory in GPU. I think you should consider doing some work in CPU and present the GPU with a data-set that does NOT require block synchronisation.