I have a kernel where each thread have to get some data from a picture. The data each thread needs is not necesary the same pixel neither is in any specific order, so I think that the only option possible here is using global memory. The problem is that global memory is really slow, and there’re not enough processes in the kernel to shadow the latence of each lecture. On the other hand, there’re not enough shared memory to copy the image as in the example of the matrix product (one pixel per thread or something similar).
What do you think should be the best option?