__syncthreads() only sync the threads in the same block,so is there any function can do the threads synchronization in a grid?
There is no way to do this that is provided directly by CUDA. You can write it yourself, but you need to be sure you have a good understanding of how the memory system works and how the thread launching system works.
In particular, you can write a barrier across multiple SMs with atomic operations if you are careful to make sure your memory instructions are ordered correctly and sent to the right level of the cache hierarchy.
You also have to keep in mind that you may end up getting fewer blocks running concurrently than the occupancy calculation would suggest (for a variety of complex reasons). You need to write code that is correct even in this case, or your code will deadlock with a small probability.
It’s also worth noting that it is much less hassle to just launch multiple kernels back-to-back to do this (even though you lose all of your on-chip state between kernels).
ok,I think I should give up the idea,i decide to split my kernel into 2 small kernels;