This may sound a little bit funny, but the development of a parallel algorithm is a ‘state of mind’. I’ve seen so many student struggle with for example OOP after they were trained to use procedural methods. At one point, however, something clicks in the brain and they ‘think’ OOP. Some take an hour to reach this point, some weeks. The same is true for my CUDA classes.
Read as much on parallel programming as possible and when it all clicks, you’ll be able to solve your problem as well.
You’ll see for example that nobody is really ‘sync-ing all threads over all blocks’, but devised methods with several steps of many independent calculations. See for example the prefix sum algorithm.
Hopefully this makes some sense ;-)