Card: GTX 580
Grid: 2x2x256
Can somebody explain me please, why the second version of kernel works twice as faster as first one ?
Version 1:
Version 2:
As far as i understand both dataTable’s should reside in global memory. What’s changed ?
Card: GTX 580
Grid: 2x2x256
Can somebody explain me please, why the second version of kernel works twice as faster as first one ?
Version 1:
Version 2:
As far as i understand both dataTable’s should reside in global memory. What’s changed ?
The copy to local memory optimizes the data layout. In the first variant, reading any word from *dataTable requires a full cacheline to be read from memory for each thread (neglecting the initial copy). In the second variant, reading a single cachline is sufficient, reducing the required bandwidth to 1/32.
Thank you.