Question about Global memory and Texture memory

Hi all,

I just made a simple kernel which is:

x = blockIdx.x * blockDim.x + threadIdx.x;
y = blockIdx.y * blockDim.y + threadIdx.y;

dst[width * y + x] = src[width * (y - 1) + (x + 1)] + 2 * src[width * y + (x + 1)] + src[width * (y + 1) + (x+1)];

dst and src are global memory buffers.

However, when I profile this kernel, I found there are texture transactions, by changing the dst writing stmt to:

dst[width * y + x] = x + y;

the texture transactions disappeared.

I tried dst[width * y + x] = src[width * y + x]; there are also texture transactions.

My questions are:

Does it mean that texture is automatically used?
Is it compiler behavior?
Would you please refer me the doc about this feature?
How can I get it in control by myself?

Susan

What GPU are you using? Kepler and later architectures provide a machine instruction LDG that performs global memory reads through the texture path. In order to do so, the memory object in question must be read-only for the duration of the kernel execution. Under the right circumstances, the compiler can prove that this condition is met, and will use LDG automatically. Programmers can also use the __ldg() intrinsic to force use of LDG. Violating the whole-kernel read-only constraint in such a case will result in undefined behavior, so use with care.

You can easily check the generated machine code with cuobjdump --dump-sass and search for instances of the LDG instruction.

In order to maximize the chances for the compiler to find possible safe uses of LDG, careful use of const and __restrict__ attributes for pointer arguments is recommended, in particular also in the combination const __restrict__. This topic should be covered in the Best Practices Guide and possibly the Programming Guide as well.

Yes I am using Kepler and later arch generation.

Thank you very much for the info. It is really helpful.

Is it there a way to forbid it? Even compiler can prove the benefit, I still wish to do some kinds of comparison sometimes.

I’m not sure if this conversion/optimization happens before or after generation of PTX. You could try compiling for a previous architecture (-arch=sm_20, or -arch=sm_30) and see if it inhibits this optimization.

According to the docs:

[url]Programming Guide :: CUDA Toolkit Documentation
[url]Programming Guide :: CUDA Toolkit Documentation
[url]Programming Guide :: CUDA Toolkit Documentation

__ldg() intrinsic requires cc3.5 or greater

I am not aware of a compiler flag that disables the generation of LDG. It may be achievable in an indirect fashion by turning off compiler optimizations, but (a) I haven’t verified this and (b) it has obviously far-reaching implications for code performance. I am not sure whether coding at PTX level can reliably eliminate the generation of LDG, since I am not sure which part of the compiler makes the substitution.

Thank you both.