The threadfence functions are memory barriers, not synchronization functions in any form. All they do is force memory contents to be flushed up the memory hierarchy far enough to guarantee visibility at the requested level (block, grid, or host). They do not lock memory locations. Race conditions are still possible.
Global barriers can be hacked together using atomic functions to implement a semaphore in device memory, but they tend to be dangerous because you have no guarantee that all of your blocks are running simultaneously, and blocks waiting at the barrier are not going to be preempted so the non-running blocks can make progress. (I’m reasonably sure that hardware before compute capability 3.5 actually can’t preempt blocks ever.) That will create a deadlock.
You can deliberately limit the number of blocks to be equal to the number of multiprocessors on the device, which will pretty much ensure all blocks are running (but again, the CUDA runtime does not guarantee this). Even then, I would never use an improvised global barrier in production code.
Also, launching kernels from inside kernels is a compute capability 3.5 feature, not a compute capability 3.0 feature.