image convolution question using large bloks of shared memory

Hello

Ive been doing image convolution with a 3x3 filter

My fisrt approach was to read one row as a block, taking the upper , the middle and the lower row into shared memory , to calculate the row of my final image

this worked, but when thinking about it in more detail one will see that its very inefficient since every pixel ( except the apron) is read 3 times
(1, if its the upper pixel, 1 if its the current pixel,1 if its the lower pixel)

so i thought to blow up my blocks to a maxmum of 512 threads, using 32x16 blocks, meeting hal warp alignment

So my shared memory array is 32x16 fields large, every thread reads one pixel, all exept the apron pixls calculate an output
(blocks have to overlap on aprons of course)

The strange thing is: This method takews just the same time as the one above, althought multiple readings of pixels from global memory is restricted to block aprons only!

Im confused, how can this be?

Hope that you know an answer!

Thanks alot, best regards

Maz