Threadfence question in B.5 of CUDA Programming Guide

I thought I understood the function of __threadfence_block() but the documentation is throwing me for a loop.

Specifically B.5 states that in the following code where thread 1 calls writeXY and thread2 calls readXY (and both are in the same block)

device volatile int X = 1, Y = 2;
device void writeXY()
{
X = 10;
__threadfence_block()
Y = 20;
}

device void readXY()
{
int A = X;
__threadfence_block()
int B = Y;
}

that for thread 2, “A will always be 10 if B is 20”. But I don’t see how this is guaranteed to be true given my understanding of how threadfences work. Specifically, it should be that these threadfences ensure that Y=20 comes after X=10 and that int B=Y comes after int A=X. This doesn’t guarantee that the following order could not occur:
int A=X
X=10
Y=20
int B=Y
leaving A=1, B=20. Again all threads still observe that Y=20 follows X=10, and the read statement int B=Y does indeed follow int A=X What am I missing?

I tried to reproduce what was happening by actually running the example, but I got a deterministic result even without __threadfence_block()s inserted.

Thanks for your help!