This SO answer implements the critical section with using the ballot function. However, this code is deprecated in the current CUDA. How would this code look nowadays?
If you meant, “__ballot_sync(FULL_MASK, req);” above, then I believe this will do the job, taking into consideration both the SO code and the “Warp Level Primatives” post.
The difference between the two is that there is no synchronisation guarantee between all active threads at this point, in the __ballot(req) case and FULLMASK to capture all threads is a valid choice.