Best option with very few neighbor reads Shared or texture memory?


I have a n x m matrix where each element is a float4 value. The algorithm processes the matrix first horizontally and then vertically:

    In the first kernel, each thread reads its left and right positions, performs some computation and writes the results in another array.

    In the second kernel, each thread reads its upper and lower positions, performs some computation and writes the results in another array.

This figure shows it graphically:

The application is computationally intensive. I know that shared memory is a better option than texture memory with many neighbor reads, since the larger the radius, the better exploitation of the shared memory. However I’m not sure if this is also true with only two reads per thread, or I will get better performance using a texture. Do you recommend texture or shared memory?


Answer 1: If your are computationally limited, it likely won’t matter much, so do whatever is simpler.

Answer 2: It isn’t clear which is faster, so try both and pick the one with better performance.

Answer 3: A third option may work well… small square tiles, so that you can do both operations in the same kernel, likely with a one element “apron” of duplicated work around each of the 4 edges. This may help with efficiency in other ways as well, especially with kernel launch overhead and idle SM losses. Scaling may work better this way especially if you ever want to go multi-GPU, or have more data than device memory, since the data stays in such local patches.

Thanks for your comments! So if it isn’t clear, I’ll have to implement both and pick the better one.

There are multiple answers because it really depends on your compute intensity. You say it’s computationally intensive… if so, then the memory access will be hidden and therefore your access method is irrelevant.

The question is how much computation is enough to hide the memory latency and bandwidth? Likely something like 20 FLOPS or more per element as a guess. If you’re at 5 FLOPS per element it’s clearly memory bound. At 50 flops per element it’s clearly computation bound.

CUDA is great (almost magical) in hiding memory access issues if you have enough compute to keep the GPU busy. But more often then not, memory is the limiting factor and that’s when you do indeed have to start answering questions like yours.