__threadfence_block() vs __threadfence() ?

there are 2 difference memory fence function __threadfence_block() and __threadfence(). I am confused about what is the difference of them when they fence global memory operations.
IMHO, when read/store from global memory, __threadfence_block() and __threadfence() just guarantee the consistency within current thread (current thread refers to who are calling theses fence functions). it only matters about the order of memory operations of current thread, looking from other threads (not just within same block). and it never impacts the execution of others threads. is it right? if yes, the functionality should be same for __threadfence_block() and __threadfence() when the memory operations are all for global memory.
if not, is there some example to tell the difference? i.e. some scenario make __threadfence_block() fail but __threadfence() work. Thanks

__threadfence() acts as __threadfence_block() for all threads in the block of the calling thread (CUDA C++ Programming Guide, page 134)