Depends on how much can be done as 128bit reads. See manual section 6.1.2.1 for how to specify alignment and how access from multiple threads to the same memory location is coalesced.
float4 fetches are two times slower, column-order fetches are two times of magnitude slower (0.9 Gb/s instead of 70).
Read speed depends on your grid setup:
A. you need many threads blocks (CTA), because of global memory aligment (in my sample, read offsets varies as gridDim.x * matrix row size). 1024 blocks is good initial value.
B. You need many thread in CTA. Thread count SHOULD be multiplication of 32
(192 or 256 or 320 is good start value to try)
C. Optimal thread count in CTA depends of thread register usage ( reg=NN in .cubin file). Each multiprocessor has 32kb register file (so 8192 floats). For kernel with 12 used registed you can run 682 threads on multiprocessor (hardwired maximum is 786). So, you cannot run three CTA of 256 threads in parallel, but two ones with 320 threads can be executed.
Sorry for my broken (Russian-alike) English. Feel free to ask.