I have a kernel where each thread have to get some data from a picture. The data each thread needs is not necesary the same pixel neither is in any specific order, so I think that the only option possible here is using global memory. The problem is that global memory is really slow, and there’re not enough processes in the kernel to shadow the latence of each lecture. On the other hand, there’re not enough shared memory to copy the image as in the example of the matrix product (one pixel per thread or something similar).
You could use Texture memory (reads are cached, and accessing near coordinate in 2D is improved, too). I never used it so I cannot tell you how faster it is than access in global memory. However, it seems that you cannot save data into Texture memory from device, you can just read them with your kernel.