What was described in (1) above deals with the re-ordering of instructions that are independent of each other, as expressed by the code itself. The compiler has no notion of run-time configuration and examines the code assuming it is executed by a single thread. If there is no data or code dependency expressed for particular operations in the code, a C++ compiler is free to re-order these operations including loads and stores.
Reductions in particular often involve cross-thread data dependencies that are not expressible by C++ code alone. Without the addition of explicit fencing or synchronization of some kind, the sequence of loads and stores necessary for proper operation of the reduction code cannot be guaranteed. The use of “volatile” to achieve the desired effect by inhibiting certain compiler optimizations on loads and stores is, in my book, a dirty trick and an abuse of this storage modifier as intended by the C++ designers. It is in common use, however. The cleaner way, IMHO, is to use appropriate fencing and / or synchronization primitives to enforce the required order of reads and writes.
Beyond this, the description of “weakly ordered” in the book is likely a reference to how the underlying GPU hardware deals with memory, and its description is correct in my view. Naturally, only the authors of the book can provide an authoritative answer as to what exactly they meant here.
My guess as to why no detail was added is because a weakly ordered memory model gives hardware a lot of latitude in terms of actual behavior, and the details can change from one chip to the next. The Wikipedia article on memory ordering (https://en.wikipedia.org/wiki/Memory_ordering) shows that all kind of different design choices are possible. For example, the memory controller in the GPU could change a sequences ld1, st1, ld2, st2 into ld1, ld2, st1, st2 to improve memory throughput, since grouping of loads with loads and stores with stores reduces read-write turnaround when accessing physical DRAM.