Texture commands in SASS code

b45h · October 21, 2024, 8:36am

I’m using 2D textures in CUDA with the tex2D() command. Now when I look into the compiled SASS-code in Nsight Compute, this command ist sometimes translated into “TEX.SCR.LL” and sometimes into “TEX.B.LL” (the Visual Profiler shows TEXS.T/TEXS.P and TEX.B.T/TEX.B.P). The performance of the two drastically differs in my case, and I’ve seen up to a 10x slowdown in case TEX.B.LL is used. Unfortunately I can’t seem to find an exlicit documentation of these commands anywhere. The PTX documentation also doesn’t hint on the distinction of these commands.

My questions are:
-What do the commands TEX.SCR.LL and TEX.B.LL explicitly mean, what’s the difference?
-Why is there such a performance difference in some cases?
-How can I control in the C++ code, which of the low-level commands are eventually used?

Curefab · October 21, 2024, 10:33am

The texture commands have variants with b parameter and variants without:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#texture-instructions-tex

An optional texture sampler b may be specified. If no sampler is specified, the sampler behavior is a property of the named texture.

I think in current architectures it is effectively the same SASS instruction (with or without B).

I do not think that the SCR variant is the opposite of B.
(Perhaps rather related to tex2Dgather or something similar, where you can fetch the raw data? Just guessing. Perhaps the inverse transform between texture and screen coordinates?)

b45h · October 22, 2024, 8:08am

It seems to me that the “texture sampler b” is just one of the variables a, b, c, … and not necessarily connected to the TEX.B command. The meaning of that “B” remains unclear.

Curefab · October 22, 2024, 9:10am

Yes, it is one of the parameters a, b, c, …

tex.geom.v4.dtype.ctype  d, [a, c] {, e} {, f};
tex.geom.v4.dtype.ctype  d[|p], [a, b, c] {, e} {, f};  // explicit sampler

If you compare those two lines, the instructions are available in a form with b and without b.

That is, what makes b special compared to a, c, …

I think for your case it is more important, what SCR is or does.

Can you provide a very short example of similar code compiling to SCR or B variants?

b45h · October 22, 2024, 12:23pm

I rechecked, and the CUDA command
tex2D<float>(texture, x, y)
(where texture is a cudaTextureObject with pitched linear memory) compiles to the PTX command
tex.2d.v4.f32.f32 {%f208, %f209, %f210, %f211}, [%rd5, {%f48, %f37}];
which contains no b-parameter. Nevertheless, on a cc 6.1 device this results in the SASS code
TEX.B.T R7, R6, R28, 0x0, 2D, 0x1 ;
as shown by the Visual Profiler (where the register numbers don’t necessarily correspond to each other). Compiling to cc 8.6 with pitched memory or compiling to cc 6.1 with cudaArray on the other hand results in TEX.SCR.LL and TEXS.T instructions, respectively, which both don’t suffer from the slowdown.

Curefab · October 22, 2024, 12:37pm

So the slow operation with TEX.B.T appears, if you use pitched linear memory on a cc 6.1 device?

cc 8.6 is always fine and using cudaArrays on cc 6.1 is also fine?

b45h · October 22, 2024, 12:43pm

Yes, it appears so. Also, cc 5.x and 7.x behave like 6.1 in this case. Only from cc 8.x on it becomes fine.

Curefab · October 22, 2024, 12:54pm

TEX.B.T R7, R6, R28, 0x0, 2D, 0x1 ;

If I understand correctly:

R7: destination [output]
R6+R7: coordinates
R28+R29: texture sampler for each coordinate
0x0: texture id
2D: texture type
0x1: components mask (just 1 component, 32-bit output)

Curefab · October 22, 2024, 1:01pm

Are you sure? The texture instructions between 7.0 and 8.6 should be quite the same, perhaps except the introduction of the uniform datapath with uniform registers with 7.5.

Curefab · October 22, 2024, 1:06pm

Can you compile pitched linear memory with 8.6, see the PTX and put it in a volatile asm block for 7.5?

b45h · October 22, 2024, 1:23pm

Not so sure actually, as I can’t see the SASS code for cc 7.x . Only judging from some performance drop that I see there as well. So you’re probably right.

b45h · October 22, 2024, 1:30pm

I don’t quite understand. Do you mean that I first compile to PTX code, then put the generated PTX commands as inline assembly in the C++ code, and then compile again for 7.5 to check if it still results in TEX.B instructions?

Curefab · October 22, 2024, 1:42pm

Exactly. Perhaps there is just a wrong heuristics between C++ and PTX. One of your original questions was, how to control, which low-level commands are issued.

b45h · October 22, 2024, 1:47pm

Okay, I will try that. However, since the above tex.2d.v4.f32.f32 command still compiled to TEX.B instructions in some cases, I fear that this may not be enough control. Let’s see.

Curefab · October 22, 2024, 2:03pm

You can look at the SASS (and PTX) code e.g. with Godbolt:

b45h · October 22, 2024, 4:15pm

Oh thanks a lot! Here is a sample code:

__global__ void kernel(cudaTextureObject_t *texp, cudaTextureObject_t tex, float *out, int N, int Nsum)
{
    int xindex = (blockIdx.x * blockDim.x) + threadIdx.x;
    int yindex = (blockIdx.y * blockDim.y) + threadIdx.y;

    float sum = 0.f;
    for (int i = 0; i < Nsum; i++)
    {
        __syncthreads();
        float x = 2e-7f * xindex;
        
        cudaTextureObject_t t1 = texp[i];
        cudaTextureObject_t t2 = tex;
        
        float texval0 = tex2D<float>(t1, x, 1.1f);
        float texval1 = tex2D<float>(t2, x, 2.1f);
        float texval2 = tex2DLayered<float>(t1, x, 3.1f, i);
        float texval3 = tex2DLayered<float>(t2, x, 4.1f, i);
        sum += texval0 + texval1 + texval2 + texval3;
    }
    *out = sum;
}

From this I see that it compiles to:

cc6.1:
-pitched memory + texture from value: TEXS.P
-pitched memory + texture from pointer: TEX.B.T, TEX.B.P
-array memory + texture from value: TEXS.T, TEXS.P
-array memory + texture from pointer: TEX.B.T, TEX.B.P

cc7.5:
-pitched memory + texture from value: TEX.SCR.LL
-pitched memory + texture from pointer: TEX.SCR.B.LL
-array memory + texture from value: TEX.SCR.LL
-array memory + texture from pointer: TEX.B.LL

cc8.6:
-pitched memory + texture from value: TEX.SCR.LL
-pitched memory + texture from pointer: TEX.SCR.B.LL
-array memory + texture from value: TEX.SCR.LL
-array memory + texture from pointer: TEX.B.LL

=> So we learn that 7.5 and 8.6 compile to the same commands, ending with .LL. The B comes in when the texture object was retreived via a pointer. But while the TEX.SCR.B.LL performs well, the TEX.B.T/P is extremely slow. How could I prevent this, when I still want to use pitched memory and only get the texture objects through pointers?

Curefab · October 22, 2024, 5:36pm

Hi b45h,

perhaps you can also consider alternatives, e.g. using a single 3D array with one of the coordinates (exact coordinate to avoid interpolation) used as index to a 2D texture instead of separate pointers to 2D textures.

Or you could try to provide a texture sampler (= independent mode) and see, if it gets faster with arrays.

Curefab · October 22, 2024, 6:00pm

P/T was some phase signifier for texture access (had to be done in two phases) - prefetch?

LL is about level of detail mode. Choosing a LOD level could also be an option instead of using a pointer.

b45h · October 23, 2024, 2:58pm

From what I observed so far, the fast TEXS instructions (=Texture fetch with scalar/non-vec4 source/destinations) in cc 6.1 are seemingly only used when the texture objects are placed in constant memory as kernel parameters. Not when they are in registers. The constant memory, however, can hold only a limited number of around 8000 texture objects, which does not suffice me.

I can try a 3D array, but that would likely also mean a waste of interpolation operations and memory fetches.

Curefab · October 23, 2024, 3:05pm

Cubemaps are 6-layered textures. Mipmaps can have more layers. Or 3D textures.

How many texture objects do you have?
IIRC with the Cuda driver interface one can access a bit more (10x?) of constant memory.

Can you fuse together several textures horizontally/laterally instead?

Topic		Replies	Views
Simplest texture 2D examples CUDA Programming and Performance	11	11439	March 26, 2019
Textures CUDA Programming and Performance	2	1643	July 22, 2008
writing to texture memory CUDA Programming and Performance	7	6620	September 29, 2009
CUDA vs DX execution times DX GPGPU code --> CUDA = slower CUDA Programming and Performance	15	13320	January 30, 2008
Texture memory fetch extremely slow CUDA Programming and Performance	13	3125	December 21, 2017
Simple texture problem Code will not compile. CUDA Programming and Performance	8	4138	February 4, 2010
CUDA texture object with linear memory seems not to be updated when fetching CUDA Programming and Performance cuda	4	275	June 17, 2024
Reading R8G8B8A8 texture using tex2D() causes strange result. CUDA Programming and Performance	27	2885	April 28, 2018
Using Textures CUDA Programming and Performance	10	21829	March 29, 2007
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2264	July 19, 2010

Texture commands in SASS code

Related topics