I’m using 2D textures in CUDA with the tex2D() command. Now when I look into the compiled SASS-code in Nsight Compute, this command ist sometimes translated into “TEX.SCR.LL” and sometimes into “TEX.B.LL” (the Visual Profiler shows TEXS.T/TEXS.P and TEX.B.T/TEX.B.P). The performance of the two drastically differs in my case, and I’ve seen up to a 10x slowdown in case TEX.B.LL is used. Unfortunately I can’t seem to find an exlicit documentation of these commands anywhere. The PTX documentation also doesn’t hint on the distinction of these commands.
My questions are:
-What do the commands TEX.SCR.LL and TEX.B.LL explicitly mean, what’s the difference?
-Why is there such a performance difference in some cases?
-How can I control in the C++ code, which of the low-level commands are eventually used?
An optional texture sampler b may be specified. If no sampler is specified, the sampler behavior is a property of the named texture.
I think in current architectures it is effectively the same SASS instruction (with or without B).
I do not think that the SCR variant is the opposite of B.
(Perhaps rather related to tex2Dgather or something similar, where you can fetch the raw data? Just guessing. Perhaps the inverse transform between texture and screen coordinates?)
It seems to me that the “texture sampler b” is just one of the variables a, b, c, … and not necessarily connected to the TEX.B command. The meaning of that “B” remains unclear.
I rechecked, and the CUDA command tex2D<float>(texture, x, y)
(where texture is a cudaTextureObject with pitched linear memory) compiles to the PTX command tex.2d.v4.f32.f32 {%f208, %f209, %f210, %f211}, [%rd5, {%f48, %f37}];
which contains no b-parameter. Nevertheless, on a cc 6.1 device this results in the SASS code TEX.B.T R7, R6, R28, 0x0, 2D, 0x1 ;
as shown by the Visual Profiler (where the register numbers don’t necessarily correspond to each other). Compiling to cc 8.6 with pitched memory or compiling to cc 6.1 with cudaArray on the other hand results in TEX.SCR.LL and TEXS.T instructions, respectively, which both don’t suffer from the slowdown.
Are you sure? The texture instructions between 7.0 and 8.6 should be quite the same, perhaps except the introduction of the uniform datapath with uniform registers with 7.5.
Not so sure actually, as I can’t see the SASS code for cc 7.x . Only judging from some performance drop that I see there as well. So you’re probably right.
I don’t quite understand. Do you mean that I first compile to PTX code, then put the generated PTX commands as inline assembly in the C++ code, and then compile again for 7.5 to check if it still results in TEX.B instructions?
Exactly. Perhaps there is just a wrong heuristics between C++ and PTX. One of your original questions was, how to control, which low-level commands are issued.
Okay, I will try that. However, since the above tex.2d.v4.f32.f32 command still compiled to TEX.B instructions in some cases, I fear that this may not be enough control. Let’s see.
__global__ void kernel(cudaTextureObject_t *texp, cudaTextureObject_t tex, float *out, int N, int Nsum)
{
int xindex = (blockIdx.x * blockDim.x) + threadIdx.x;
int yindex = (blockIdx.y * blockDim.y) + threadIdx.y;
float sum = 0.f;
for (int i = 0; i < Nsum; i++)
{
__syncthreads();
float x = 2e-7f * xindex;
cudaTextureObject_t t1 = texp[i];
cudaTextureObject_t t2 = tex;
float texval0 = tex2D<float>(t1, x, 1.1f);
float texval1 = tex2D<float>(t2, x, 2.1f);
float texval2 = tex2DLayered<float>(t1, x, 3.1f, i);
float texval3 = tex2DLayered<float>(t2, x, 4.1f, i);
sum += texval0 + texval1 + texval2 + texval3;
}
*out = sum;
}
From this I see that it compiles to:
cc6.1:
-pitched memory + texture from value: TEXS.P
-pitched memory + texture from pointer: TEX.B.T, TEX.B.P
-array memory + texture from value: TEXS.T, TEXS.P
-array memory + texture from pointer: TEX.B.T, TEX.B.P
cc7.5:
-pitched memory + texture from value: TEX.SCR.LL
-pitched memory + texture from pointer: TEX.SCR.B.LL
-array memory + texture from value: TEX.SCR.LL
-array memory + texture from pointer: TEX.B.LL
cc8.6:
-pitched memory + texture from value: TEX.SCR.LL
-pitched memory + texture from pointer: TEX.SCR.B.LL
-array memory + texture from value: TEX.SCR.LL
-array memory + texture from pointer: TEX.B.LL
=> So we learn that 7.5 and 8.6 compile to the same commands, ending with .LL. The B comes in when the texture object was retreived via a pointer. But while the TEX.SCR.B.LL performs well, the TEX.B.T/P is extremely slow. How could I prevent this, when I still want to use pitched memory and only get the texture objects through pointers?
perhaps you can also consider alternatives, e.g. using a single 3D array with one of the coordinates (exact coordinate to avoid interpolation) used as index to a 2D texture instead of separate pointers to 2D textures.
Or you could try to provide a texture sampler (= independent mode) and see, if it gets faster with arrays.
From what I observed so far, the fast TEXS instructions (=Texture fetch with scalar/non-vec4 source/destinations) in cc 6.1 are seemingly only used when the texture objects are placed in constant memory as kernel parameters. Not when they are in registers. The constant memory, however, can hold only a limited number of around 8000 texture objects, which does not suffice me.
I can try a 3D array, but that would likely also mean a waste of interpolation operations and memory fetches.