I have a n x m matrix where each element is a float4 value. The algorithm processes the matrix first horizontally and then vertically:
In the first kernel, each thread reads its left and right positions, performs some computation and writes the results in another array.
In the second kernel, each thread reads its upper and lower positions, performs some computation and writes the results in another array.
This figure shows it graphically:
The application is computationally intensive. I know that shared memory is a better option than texture memory with many neighbor reads, since the larger the radius, the better exploitation of the shared memory. However I’m not sure if this is also true with only two reads per thread, or I will get better performance using a texture. Do you recommend texture or shared memory?