Using tex2D for unsigned short/char

Hi,

I’m looking for the fastest way to interpolate an integer-valued (uint8/16) image.
The interpolation should be bilinear and the resulting array should be encoded in single floats (32 bits).

I tried using textures and tex2D, but i had to cast everything into float2, i didn’t manage to make it directly from uint8/16.

I saw that i could use cudaArrays, but the content of my texture is going to change every frame (at a really high frame rate), is it still a good solution ?

I am totally new to the use of textures, I’ll be grateful for any suggestions.

that works for sure.

you have to set the filter mode and read mode to the proper values.

See any tutorial on the cuda texture function or the example at https://stackoverflow.com/questions/17075617/setting-up-a-cuda-2d-unsigned-char-texture-for-linear-interpolation

Btw: I would use texture objects instead global texture references.

There is a caveat when using integer textures though (or at least there was the last time I tried it) which is that the it won’t produce non-integer values. For some reason it seems to do the interpolation before the promotion to float.

As is explained in the CUDA Programming Guide section G.2 , all hardware texture interpolation happens using a fixed-point format:

Note the 8-bits of fraction which match the 8 bits in an unsigned integer, explaining your observation. Where higher-quality interpolation is required, use single-precision FMAs, the GPU provides copious throughput for those (“too cheap to meter”).

I am aware of that fixed point format limitation that you mention but the way I understand it that just means that the sampling coordinate is effectively rounded to a 1/256th of a texel (and this happens regardless of whether the texture is float or int).

The issue I saw was specific to integer textures (I think I was using 16-bit). The result was that if you had a texture with two texels containing the values 0 and 1 then regardless of where you sampled it you could only ever get one of two possible values. My recollection is that if the texture contained the values 0 and 65535 then you could get intermediate values.

Having said that, the quoted stackoverflow example doesn’t, on the face of it, appear to have the same problem so it is perhaps something that has been fixed since I tested it (or perhaps its something that behaves differently with 8-bit vs. 16-bit int textures).

Interestingly, the CUDA Programming Guide also says

G.2. Linear Filtering
In this filtering mode, which is only available for floating-point textures, the value returned by the texture fetch is

which I don’t think I’ve ever noticed before. It’s a little bit ambiguous about whether this means the textures really have to be stored as floating point or if they can be integer textures read as float.

I think you will find that your observations are entirely consistent with the fact that texture interpolation is done with fixed-point computation that is performed with eight bits of fraction.

The cited Stackoverflow example was posted by me, and is also not at odds with the 8-bit resolution. Note that I limited the output to two decimal places, which provides a resolution of 1/100, whereas the 8 bits of fractions in the hardware interpolation provide a resolution of 1/256. The limited granularity would become apparent if data with more digits were used.

The fact remains that hardware texture interpolation is a fast computation with low resolution which is not suitable for all use cases. To the best of my knowledge and observations, the computational details of GPU texture interpolation have not changed since CUDA first shipped in late 2007, so I think it is reasonable to assume that they won’t change anytime soon.

Modern GPUs provide such a high single-precision FMA throughput compared to available memory bandwidth that I would seriously suggest that programmers to look into that alternative to hardware texture interpolation.

I agree with you about FMA being viable alternative to hardware texture interpolation. If you look here

https://www5.cs.fau.de/research/former-projects/rabbitct/ranking/?ps=1024

then you can compare the performance and accuracy of my “Elmer (high accuracy)” implementation which uses texture gather and FMA and my original “Elmer” implementation which uses hardware texture interpolation.

I’m probably being really stupid, so please bear with me while I try to understand this.

I have modified your example slightly so it now reads:

#include <stdlib.h>
#include <stdio.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;

__global__ void kernel (int m, int n, float shift_x, float shift_y) 
{
    float val;
    for (int row = 0; row < m; row++) {
        for (int col = 0; col < n; col++) {
            val = tex2D (tex, col+0.5f+shift_x, row+0.5f+shift_y);
            printf ("%.3f ", val*255.0f);
        }
        printf ("\n");
    }
}

int main (void)
{
    int m = 4; // height = #rows
    int n = 3; // width  = #columns
    size_t pitch, tex_ofs;
    unsigned char arr[4][3]= {{11,12,13},{21,22,23},{31,32,33},{251,252,253}};
    unsigned char *arr_d = 0;

    CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d,&pitch,n*sizeof(*arr_d),m));
    CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
                                n*sizeof(arr[0][0]),m,cudaMemcpyHostToDevice));
    tex.normalized = false;
    tex.filterMode = cudaFilterModeLinear;
    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &tex.channelDesc,
                                       n, m, pitch));
    if (tex_ofs !=0) {
        printf ("tex_ofs = %zu\n", tex_ofs);
        return EXIT_FAILURE;
    }
    printf ("reading array straight\n");
    kernel<<<1,1>>>(m, n, 0.0f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted in x-direction\n");
    kernel<<<1,1>>>(m, n, 0.5f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted in y-direction\n");
    kernel<<<1,1>>>(m, n, 0.0f, 0.5f);
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    CUDA_SAFE_CALL (cudaFree (arr_d));
    return EXIT_SUCCESS;
}

and outputs:

reading array straight
11.000 12.000 13.000
21.000 22.000 23.000
31.000 32.000 33.000
251.000 252.000 253.000
reading array shifted in x-direction
11.502 12.502 13.000
21.502 22.502 23.000
31.502 32.502 33.000
251.502 252.502 253.000
reading array shifted in y-direction
16.000 17.000 18.000
26.000 27.000 28.000
141.000 142.000 143.000
251.000 252.000 253.000

This is as I would expect.

But if I change it from unsigned char to unsigned short (and change the 255.0f to 65535.0f in the kernel) then it outputs:

reading array straight
11.000 12.000 13.000
21.000 22.000 23.000
31.000 32.000 33.000
251.000 252.000 253.000
reading array shifted in x-direction
12.000 13.000 13.000
22.000 23.000 23.000
32.000 33.000 33.000
252.000 253.000 253.000
reading array shifted in y-direction
16.000 17.000 18.000
26.000 27.000 28.000
141.000 142.000 143.000
251.000 252.000 253.000

This is the behaviour I observed previously.

So, for the unsigned char case, taking the top middle cell with the shift in the x-direction it appears to be doing something like:
(12 * 127 + 13 * 128) / 255 = 12.502

This is already slightly strange because to me the wording in the manual suggests (to me at least) that the divisor should be 256 rather than 255 and that a shift of 0.5 should therefore be exact but its a minor point.

What I don’t understand is what it is doing when I change it to unsigned short. Why is it not still:
(12 * 127 + 13 * 128) / 255 = 12.502
?

Having added even more decimal places it seems that it is actually (for unsigned char) doing:

(12 * 128 + 13 * 129) / 257 = 12.501945

I have passing familiarity with the Rabbit CT benchmark as I have had discussions about it with one of the top contestants (who uses FMAs for accurate interpolation).

It has been a long time since I dove into the exact details of GPU texture interpolation. I think your analysis may be thrown somewhat by the details of conversions from and to internal fixed-point representation. I see to recall the manner of these conversions is prescribed in detail by graphics standards such as OpenGL but my memory is very hazy after a dozen years.

But it should be clear even with the small deviations observed in your examples that the issues with limited granularity are generally consistent with intermediate processing with 8-bit resolution.

I had a quick look in the OpenGL documentation but all I could find was a note that the weights could be fixed-point with the precision being implementation-specific.

However, this NVIDIA patent is interesting:

http://www.google.tl/patents/US20060028482

It does look like floating-point textures get converted to fixed-point (with a common exponent) and then go through the fixed-point interpolation pipeline rather than the other way around. If we then assume that the fixed-point interpolation pipeline is only wide enough to produce 16-bit of mantissa output then I guess that more or less fits all of the observations.

Drawing on my hazy memory, how the various types supported by OpenGL(-ES) should be converted into each other is described in a separate part of the standard, not in the texturing section. Or maybe it is even discussed in bits and pieces in various parts of the spec (I found the organization of the OpenGL spec somewhat confusing).

The important part here is that the interpolation factors in the GPU interpolation can only take one of 257 different values in [0.0, 1.0], so no matter what the type of the output data, the output will be limited to at most 257 different discrete results across the full interpolation range.

Looking at relevant patents can often clarify implementation details not spelled out in official documentation, however (in my layman’s understanding) at least in American jurisdictions it can substantially increase the risk of “willful infringement” of someone’s IP, so as a general rule I don’t do it.

Besides the minor discrepancy I noted, I think that explains the results for unsigned char: 256 possible input values and if two adjacent pixels differ by 1 then you can get 256 different values between them corresponding to different interpolation factors. So basically, a total of 65536 possible different results.

But for unsigned short I still don’t quite get it. We now have 65536 possible input values and if two adjacent pixels differ by 1 then we can get 256 different values between them. So we should have a total of 16777216 possible different results but we still only have 65536. That’s why I say that there must be a 16-bit limitation somewhere. My guess is that it’s specific to NVIDIA and that I won’t find any mention of it in the OpenGL documentation.

I’ve been advised in the past that not looking at relevant patents can constitute a failure to perform due diligence - I guess you can’t win.

I am not a lawyer. In my layman’s understanding, looking at relevant patents for “due diligence” is a task for a lawyer, not an engineer (unless the latter is instructed by the former to do so).

From what I understand, as a creator of IP, knowing less about what other IP creators are doing reduces risk should infringement ever be alleged.

For example, from what I have read, copyright cases in the music industry have found that “subconscious copying” does not protect from a charge of copyright infringement. I am not aware of cases where this has been applied to software, but it seems reasonable to assume that having familiarity with a particular piece of code would likely be seen as making subconscious copying more likely. For that reason I do not study code under GPL (the viral nature of GPL could have disastrous consequences if subconscious copying were found to have occurred).