Best option with very few neighbor reads Shared or texture memory?

cudouer · January 12, 2010, 11:10pm

Hello,

I have a n x m matrix where each element is a float4 value. The algorithm processes the matrix first horizontally and then vertically:

[*]In the first kernel, each thread reads its left and right positions, performs some computation and writes the results in another array.

[*]In the second kernel, each thread reads its upper and lower positions, performs some computation and writes the results in another array.

This figure shows it graphically:

External Media

The application is computationally intensive. I know that shared memory is a better option than texture memory with many neighbor reads, since the larger the radius, the better exploitation of the shared memory. However I’m not sure if this is also true with only two reads per thread, or I will get better performance using a texture. Do you recommend texture or shared memory?

Thanks.

SPWorley · January 13, 2010, 12:13am

Answer 1: If your are computationally limited, it likely won’t matter much, so do whatever is simpler.

Answer 2: It isn’t clear which is faster, so try both and pick the one with better performance.

Answer 3: A third option may work well… small square tiles, so that you can do both operations in the same kernel, likely with a one element “apron” of duplicated work around each of the 4 edges. This may help with efficiency in other ways as well, especially with kernel launch overhead and idle SM losses. Scaling may work better this way especially if you ever want to go multi-GPU, or have more data than device memory, since the data stays in such local patches.

cudouer · January 13, 2010, 5:56pm

Thanks for your comments! So if it isn’t clear, I’ll have to implement both and pick the better one.

SPWorley · January 13, 2010, 6:32pm

There are multiple answers because it really depends on your compute intensity. You say it’s computationally intensive… if so, then the memory access will be hidden and therefore your access method is irrelevant.

The question is how much computation is enough to hide the memory latency and bandwidth? Likely something like 20 FLOPS or more per element as a guess. If you’re at 5 FLOPS per element it’s clearly memory bound. At 50 flops per element it’s clearly computation bound.

CUDA is great (almost magical) in hiding memory access issues if you have enough compute to keep the GPU busy. But more often then not, memory is the limiting factor and that’s when you do indeed have to start answering questions like yours.

Topic		Replies	Views
Question about texture/shared memory enhance the computing efficiency CUDA Programming and Performance	3	5381	December 4, 2007
Shared Mem caching strategy Comparison of benchmark results CUDA Programming and Performance	9	4187	May 11, 2008
texture memory or shared memory? which is faster, and by what factor? CUDA Programming and Performance	0	1146	March 14, 2008
Texture? Just a short lesson... CUDA Programming and Performance	5	2715	March 9, 2008
Shared memory vs texture fetches CUDA Programming and Performance	0	1912	April 26, 2007
Median Filter CUDA Programming and Performance	5	8688	October 11, 2009
Shared vs. texture memory CUDA Programming and Performance	6	3176	April 18, 2009
Whether use shared memory? CUDA Programming and Performance	8	4480	April 15, 2008
Shared Memory usage slows kernel with texture fetch CUDA Programming and Performance	8	4143	June 20, 2011
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1836	November 1, 2010

Best option with very few neighbor reads Shared or texture memory?

Related topics