Is there any doc that can introduce __barrier_sync() and __barrier_sync_count()?

Yes, you can use it in cuda. see Compiler Explorer

It uses the barrier ptx instruction 1. Introduction — parallel-thread-execution 8.1 documentation

Its similar to __syncthreads()