Linear interpolation with integer texture.

“Linear texture filtering may be done only for textures that are configured to return floating-point data.”

This statement is unclear. It can be interpreted in two ways.

  1. It’s fine if the texture memory is integer, so long as you access it using cudaReadModeNormalizedFloat.

  2. Everything, including the texture memory, must be float.

So which is it?

I assumed that 1) was the correct interpretation. I get valid reads for positions that exactly match a texel position, and random garbage otherwise (the values fluctuate when I read in two different kernels). I assume this happens because the hardware actually can’t do bilinear interpolation for integer data. Is that correct?


Below is a worked example of bilinear interpolation in a 2D texture, where texels are of an integer type. One thing to keep in mind is that the hardware interpolation has very low resolution, using a 1.8 fixed-point format internally (See section F.2. Linear Filtering of the CUDA C Programming Guide). So the granularity of the interpolation is going to be rather coarse and depending on your use case you may be better off doing your own interpolation. The output of the code below should look like this:

reading array straight
    11.00000      12.00000      13.00000
    21.00000      22.00000      23.00000
    31.00000      32.00000      33.00000
   253.00000     254.00000     255.00000
reading array shifted in x-direction
    12.00000      13.00000      13.00000
    22.00000      23.00000      23.00000
    32.00000      33.00000      33.00000
   254.00000     255.00000     255.00000
reading array shifted in y-direction
    11.00000      12.00000      13.00000
    16.00000      17.00000      18.00000
    26.00000      27.00000      28.00000
   142.00000     143.00000     144.00000

Here is the source code:

#include <stdlib.h>
#include <stdio.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

texture<unsigned short, 2, cudaReadModeNormalizedFloat> tex;

__global__ void kernel (int m, int n, float shift_x, float shift_y) 
    float val;
    for (int row = 0; row < m; row++) {
        for (int col = 0; col < n; col++) {
            val = 65535.0 * tex2D (tex, col+0.5f+shift_x, row+0.5f+shift_y);
            printf ("%12.5f  ", val);
        printf ("\n");

int main (void)
    int m = 4; // height = #rows
    int n = 3; // width  = #columns
    size_t pitch, tex_ofs;
    unsigned short arr[4][3]= {{11,12,13},{21,22,23},{31,32,33},{253,254,255}};
    unsigned short *arr_d = 0;

    CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
    tex.normalized = false;
    tex.filterMode = cudaFilterModeLinear;
    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &tex.channelDesc,
                                       n, m, pitch));
    if (tex_ofs !=0) {
        printf ("tex_ofs = %zu\n", tex_ofs);
        return EXIT_FAILURE;
    printf ("reading array straight\n");
    kernel<<<1,1>>>(m, n, 0.0f, 0.0f);
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted in x-direction\n");
    kernel<<<1,1>>>(m, n, 1.0f, 0.0f);
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted in y-direction\n");
    kernel<<<1,1>>>(m, n, 0.0, -0.5f);
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    CUDA_SAFE_CALL (cudaFree (arr_d));
    return EXIT_SUCCESS;

Thanks for the example. The error was a bad case of PEBKAC.

#if PIX_TYPE == 1
float mult = 255.0f;
float mult = 65535.0f;

That should have been PIX_SIZE. Multiplying by 65535.0 instead of 255.0 rollovers to the correct value, unless there was an interpolation going on, in which case it yields garbage.

The example given is nice but incomplete. A horizontal shift of +0.5 gives fractional answers but using the 2 byte integer types returns values rounded to the nearest integer. These are LARGE errors of 0.5. Using unsigned char textures returns errors of about 0.0019 and float textures return exact values.

texture results.txt (862 Bytes)

the code is here (3.2 KB)

This means 16-bit integer data cannot be used for accurate work with textures but both 8-bit integer and 32-bit float can be. As far as I know this is an undocumented “feature”.

This seems consistent with the fact that hardware interpolation uses 1.8 bit fixed-point arithmetic as documented in the CUDA Programming Guide. An experiment I would recommend to get a better feel for the quantization effect is to interpolate in tiny increments between 0 and 1 and observe the output change in the form of a step function reflecting the 8-bit quantization.

For use cases for which the hardware texture interpolation does not provide sufficient resolution, I recommend performing the interpolation manually using single-precision FMAs. The performance of modern GPU hardware is typically high enough using that approach to be sufficient even for ambitious use cases (e.g. high-resolution CT scans with volumes of 1024 cubed).

njuffa, I respectfully beg to differ. It is true that interpolation coefficients are rounded to the nearest multiple if 1/256 but all other arithmetic is not. This means on grids of equally spaced 257 points in [0,1] in up to 3 dimensions potentially exact results can be obtained. In fact for small integers float32 does give exact results, uchar has errors of 1/512 or less but int16 has errors up to 0.5. I think this is unexpected behaviour and should at least be documented by NVIDIA.
I think it is a common misconception that CUDA textures always inaccurate and it is as pity they are not more used in “serious” applications. I am a physicist who as worked in medical imaging for some time, you correctly mention CT, but 3D image registration in MRI or mixed modality imaging is another important use case and is computationally demanding. Texture fetches are about 5 time faster that simple kernel code on recent GPUs and both are bandwidth limited so that 8-bit fetches are 4 times faster that 32-bit fetches. My recent book “Programming in Parallel with CUDA”, CUP, doi:10.1017/9781108855273 has a chapter on textures which is rare for books of this kind.
The code I posted is designed to do the sort of experiment you suggest so maybe you could use it to verify that sometimes textures are very accurate.

I think we both share a love of CUDA - best Richard Ansorge.

I do not claim to be infallible. You could always file a bug report with NVIDIA and see what they have to say regarding your observations.

From what I recall from assisting someone working on CT scans up to 1024 cubed is that the version using manual interpolation based on fmaf() ran at roughly one third the speed of the texture-based version, with significantly enhanced accuracy. With today’s hardware (Tesla A100, Quadro A6000) one should be able to target problems of size 2048 cubed with the same approach.