Perfomance question need some explanation

Card: GTX 580

Grid: 2x2x256

Can somebody explain me please, why the second version of kernel works twice as faster as first one ?

Version 1:

Version 2:

As far as i understand both dataTable’s should reside in global memory. What’s changed ?

The copy to local memory optimizes the data layout. In the first variant, reading any word from *dataTable requires a full cacheline to be read from memory for each thread (neglecting the initial copy). In the second variant, reading a single cachline is sufficient, reducing the required bandwidth to 1/32.

Thank you.