gld_coherent and gst_coherent inconsistent?

I’m testing with an extremely simply case , read a large matrix of 16544*480 floats, for each element, compute its square and write back to its original place in global memory.
However, from CUDA Visual Profiler 1.0, the reported gld_coherent is about 63000, and gst_coherent is about 253000.
I’m wandering why there is 4 times gap in gld_coherent and gst_coherent, in this simple case where obviously ld and st should be equal?

Another side question is what gld_coherent and gst_coherent really means, how to relate the number to real memory access?