Many in the forums have tried and posted their results.
No, it works horribly. As an example of the simplest problem encountered: not all blocks will run concurrently. Therefore, your global barrier will deadlock.
Yes. Just use multiple kernel invocations. Is 10 microseconds that bad an overhead to pay?