cudaBindTexture2D and offset

CUDA Toolkit Reference Manual 3.2 says about function template cudaBindTexture2D (4.18.2.5):

I have no clue how I should use the alignment with a tex2D call. I think I knew how to combine this with tex1Dfetch for a 1D texture, but there is no tex2Dfetch. I know that I would not run into this problem if data is aligned “properly”, but I would like to handle a case where this is just not the case…

If this is a re-post, I’m sorry. I did a search in the forum and read the post for every promising topic without luck.

Basically what the texture binding does is snap the start of the texture to the next lower texture-aligned address, and returns the difference in bytes between that address and the starting address of the data we want to map. For the 1-D case, this means we need to divide the byte offset by the texel size, and add this to the texture index when accessing the texture. Also, we need to make sure the texture is lengthened by the same number of texels, to make sure the texture covers all the data we want mapped.

The two-dimensional case works analogous. By snapping the start of the texture to the next lower texture-aligned address, we are basically moving the starting column of that texture by (texture offset / texel size) texel. This means we need to add the texel offset to the column portio of our two-dimensional index when accessing the texture. Also, we need to make sure the texture width is increased by the same number of texels, to make sure the texture covers all the data we want mapped.

In the following ASCII-art graphs below (hope they survive) the thinly outlined outer boxes represent texture-aligned chunks of bytes, e.g. 512 bytes on Fermi-class device. The numbers in the corner of these boxes show the address order of these chunks. The boxes with the thinker outlines represent unaligned data portions that we want to bind to a texture.

1D case

=======

//

            <~~~~~~~~~~~ n ~~~~~~~~~~~>

+--------------+--------------+--------------+--------------+

:           +=========================+      :              :

:-- ofs --->|  data to be mapped      |      :              :

: 1         +=========================+    3 :            4 :

+--------------+--------------+--------------+--------------+

            '                         '

index       0                       n-1

//

<~~~~~~~~~~~~~~ n + ofs ~~~~~~~~~~~~~~>                            

+--------------+--------------+--------------+--------------+

+=====================================+      :              :

|             mapped data             |      :              :

+=====================================+    3 :            4 :

+--------------+--------------+--------------+--------------+

'           '                         '

0           ofs                 n+ofs-1

//

//

2D case

=======

//

                            <~~~~~~~~~~~ n ~~~~~~~~~~~>

+--------------+--------------+--------------+--------------+

:              :              :              :              :

:              :              :              :              :

: 0            : 1            : 2            :            3 :

+--------------+--------------+--------------+--------------+   row

:              :           +=========================+      : __0

:              :-- ofs --->|                         |      :

: 4            : 5         |       data              |    7 :

+--------------+-----------|                         |------+

:              :           |                         |      :

:              :           |       to be             |      :

: 8            : 9         |                         |   11 :

+--------------+-----------|                         |------+

:              :           |       mapped            |      :

:              :           |                         |      :

: 12           : 13        +=========================+   15 : __m-1 

+--------------+--------------+--------------+--------------+

                           '                         '

column                     0                       n-1

//

               <~~~~~~~~~~~~~~ n + ofs ~~~~~~~~~~~~~~>            

+--------------+--------------+--------------+--------------+

:              :              :              :              :

:              :              :              :              :

: 0            : 1            : 2            :            3 :

+--------------+--------------+--------------+--------------+   row

:              +=====================================+      : __0 

:              |                                     |      :

: 4            |         mapped                      |    7 :

+--------------+                                     |------+

:              |                                     |      :

:              |         data                        |      :

: 8            |                                     |   11 :

+--------------+                                     |------+

:              |                                     |      :

:              |                                     |      :

: 12           +=====================================+   15 : __m-1

+--------------+--------------+--------------+--------------+

               '           '                         '

column         0           ofs                 n+ofs-1

The following code demonstrates binding of a small 4x3 texture at all possible offsets in a 7x10 matrix. Each matrix element contains its row and column number (encoded as row * 100 + column), and the kernel prints the value of all matrix elements mapped by the texture.

#include <stdio.h>

#include <stdlib.h>

#define ARR_WIDTH   10

#define ARR_HEIGHT  7

#define TEX_WIDTH   3

#define TEX_HEIGHT  4

// Macro to catch CUDA errors in CUDA runtime calls

#define CUDA_SAFE_CALL(call)                                          \

do {                                                                  \

    cudaError_t err = call;                                           \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString(err) );       \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

} while (0)

texture<float,2> tex;

__global__ void kernel (int colOfs)

{

    int x = threadIdx.x;

    int y = threadIdx.y;

    printf ("Texture value at y:%i x:%i = %03.0f\n", 

            y, x, tex2D (tex , x + colOfs + 0.5 , y + 0.5f));   

}

int main()

{

    dim3 grids(1,1);

    dim3 threads(TEX_WIDTH, TEX_HEIGHT);

    float matrix[ARR_WIDTH * ARR_HEIGHT];

    size_t texOfs = 0;

    size_t pitch = 0;

    float *devPtr = 0;

    int elemPitch;

    int colOfs = 0;

    int posX = 0;

    int posY = 0;

CUDA_SAFE_CALL (cudaMallocPitch ((void**)&devPtr, &pitch, 

                                     ARR_WIDTH*sizeof(matrix[0]), 

                                     ARR_HEIGHT));

    CUDA_SAFE_CALL (cudaMemset (devPtr, 0xff, pitch * ARR_HEIGHT));

    for (int y = 0; y < ARR_HEIGHT ; y++) {

        for (int x = 0; x < ARR_WIDTH ; x++) {

            matrix [y * ARR_WIDTH + x] = y * 100 + x;

        }

    }

    CUDA_SAFE_CALL (cudaMemcpy2D (devPtr,

                                  pitch,

                                  matrix,

                                  ARR_WIDTH*sizeof(matrix[0]),

                                  ARR_WIDTH*sizeof(matrix[0]),

                                  ARR_HEIGHT,

                                  cudaMemcpyHostToDevice));

    elemPitch = pitch / sizeof(matrix[0]);

for (posX = 0; posX <= ARR_WIDTH-TEX_WIDTH; posX++) {

        for (posY = 0; posY <= ARR_HEIGHT-TEX_HEIGHT; posY++) {

            printf ("\ntexture anchored at x=%d,y=%d\n", posX, posY);

            CUDA_SAFE_CALL (cudaBindTexture2D (&texOfs,

                                               &tex,

                                               &devPtr[posY*elemPitch+posX],

                                               &tex.channelDesc,

                                               TEX_WIDTH,

                                               TEX_HEIGHT, 

                                               pitch));

            if (texOfs != 0) {

                printf ("binding misaligned, rebinding a wider texture\n");

                colOfs = (int)(texOfs / sizeof(matrix[0]));

                CUDA_SAFE_CALL (cudaUnbindTexture(tex));

                CUDA_SAFE_CALL (cudaBindTexture2D(&texOfs,

                                                  &tex,

                                                  &devPtr[posY*elemPitch+posX],

                                                  &tex.channelDesc,

                                                  TEX_WIDTH + colOfs,

                                                  TEX_HEIGHT, 

                                                  pitch));

            }

            kernel<<<grids,threads>>>(colOfs);

            CUDA_SAFE_CALL (cudaUnbindTexture (tex));

            CUDA_SAFE_CALL (cudaDeviceSynchronize());

        }

    }

CUDA_SAFE_CALL (cudaFree (devPtr));

    return EXIT_SUCCESS;

}

Thank you very much for the explanation!