Fermi L1 cache line filled completely before SP access the data?

in cuda, If 32 threads in one warp access exact one same address.
Before core access the data, will it wait for all 128bytes to be filled in cache line? If so, then disable L1 cache will be faster?