random access isn’t neccessarily “random”, but it’s in any case not linear. neither is it square nor cubic. “random” in this context really only means non-regular, and more specifically access that doesn’t line up with the memory architecture.
in your case the access is pretty regular so it shouldn’t be too difficult to make it line up pretty well. since you have a 5x5 filter presumably you’re reading each pixel 5x5=25 times. in that case i’d recommend reading sections of the image into shared memory and working with it there. you could get something like a 25x speedup.
in that case store the 5x5 filter in constant memory, which should be as fast as shared memory, without using any shared memory.
and i believe the speedup of texture memory (besides free interpolation) is when you access it linearly or in a blockwise fashion (2-d). if you’re storing to shared memory you can definitely read it in linearly. maybe in each block read in (x-2,y-2) to (x+w+2,y+h+2) and then run the filter in parallel from (x,y) to (x+w,y+h).
an advantage of this is that varying your filter size will have a fairly neglible effect on your i/o utilization. and it’s deliberate and exact instead of cacheing which is more heuristic and might make the wrong decisions.