Using tex2D for unsigned short/char

lapeyre.florian · November 13, 2017, 10:01am

Hi,

I’m looking for the fastest way to interpolate an integer-valued (uint8/16) image.
The interpolation should be bilinear and the resulting array should be encoded in single floats (32 bits).

I tried using textures and tex2D, but i had to cast everything into float2, i didn’t manage to make it directly from uint8/16.

I saw that i could use cudaArrays, but the content of my texture is going to change every frame (at a really high frame rate), is it still a good solution ?

I am totally new to the use of textures, I’ll be grateful for any suggestions.

HannesF99 · November 14, 2017, 9:18am

that works for sure.

you have to set the filter mode and read mode to the proper values.

See any tutorial on the cuda texture function or the example at setting up a CUDA 2D "unsigned char" texture for linear interpolation - Stack Overflow

Btw: I would use texture objects instead global texture references.

sam_hawker · November 14, 2017, 9:06pm

There is a caveat when using integer textures though (or at least there was the last time I tried it) which is that the it won’t produce non-integer values. For some reason it seems to do the interpolation before the promotion to float.

njuffa · November 14, 2017, 9:44pm

As is explained in the CUDA Programming Guide section G.2 , all hardware texture interpolation happens using a fixed-point format:

Note the 8-bits of fraction which match the 8 bits in an unsigned integer, explaining your observation. Where higher-quality interpolation is required, use single-precision FMAs, the GPU provides copious throughput for those (“too cheap to meter”).

sam_hawker · November 14, 2017, 11:05pm

I am aware of that fixed point format limitation that you mention but the way I understand it that just means that the sampling coordinate is effectively rounded to a 1/256th of a texel (and this happens regardless of whether the texture is float or int).

The issue I saw was specific to integer textures (I think I was using 16-bit). The result was that if you had a texture with two texels containing the values 0 and 1 then regardless of where you sampled it you could only ever get one of two possible values. My recollection is that if the texture contained the values 0 and 65535 then you could get intermediate values.

Having said that, the quoted stackoverflow example doesn’t, on the face of it, appear to have the same problem so it is perhaps something that has been fixed since I tested it (or perhaps its something that behaves differently with 8-bit vs. 16-bit int textures).

Interestingly, the CUDA Programming Guide also says

G.2. Linear Filtering
In this filtering mode, which is only available for floating-point textures, the value returned by the texture fetch is

which I don’t think I’ve ever noticed before. It’s a little bit ambiguous about whether this means the textures really have to be stored as floating point or if they can be integer textures read as float.

njuffa · November 15, 2017, 3:56am

I think you will find that your observations are entirely consistent with the fact that texture interpolation is done with fixed-point computation that is performed with eight bits of fraction.

The cited Stackoverflow example was posted by me, and is also not at odds with the 8-bit resolution. Note that I limited the output to two decimal places, which provides a resolution of 1/100, whereas the 8 bits of fractions in the hardware interpolation provide a resolution of 1/256. The limited granularity would become apparent if data with more digits were used.

The fact remains that hardware texture interpolation is a fast computation with low resolution which is not suitable for all use cases. To the best of my knowledge and observations, the computational details of GPU texture interpolation have not changed since CUDA first shipped in late 2007, so I think it is reasonable to assume that they won’t change anytime soon.

Modern GPUs provide such a high single-precision FMA throughput compared to available memory bandwidth that I would seriously suggest that programmers to look into that alternative to hardware texture interpolation.

sam_hawker · November 15, 2017, 11:10am

I agree with you about FMA being viable alternative to hardware texture interpolation. If you look here

https://www5.cs.fau.de/research/former-projects/rabbitct/ranking/?ps=1024

then you can compare the performance and accuracy of my “Elmer (high accuracy)” implementation which uses texture gather and FMA and my original “Elmer” implementation which uses hardware texture interpolation.

sam_hawker · November 15, 2017, 11:11am

I’m probably being really stupid, so please bear with me while I try to understand this.

I have modified your example slightly so it now reads:

#include <stdlib.h>
#include <stdio.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;

__global__ void kernel (int m, int n, float shift_x, float shift_y) 
{
    float val;
    for (int row = 0; row < m; row++) {
        for (int col = 0; col < n; col++) {
            val = tex2D (tex, col+0.5f+shift_x, row+0.5f+shift_y);
            printf ("%.3f ", val*255.0f);
        }
        printf ("\n");
    }
}

int main (void)
{
    int m = 4; // height = #rows
    int n = 3; // width  = #columns
    size_t pitch, tex_ofs;
    unsigned char arr[4][3]= {{11,12,13},{21,22,23},{31,32,33},{251,252,253}};
    unsigned char *arr_d = 0;

    CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d,&pitch,n*sizeof(*arr_d),m));
    CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
                                n*sizeof(arr[0][0]),m,cudaMemcpyHostToDevice));
    tex.normalized = false;
    tex.filterMode = cudaFilterModeLinear;
    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &tex.channelDesc,
                                       n, m, pitch));
    if (tex_ofs !=0) {
        printf ("tex_ofs = %zu\n", tex_ofs);
        return EXIT_FAILURE;
    }
    printf ("reading array straight\n");
    kernel<<<1,1>>>(m, n, 0.0f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted in x-direction\n");
    kernel<<<1,1>>>(m, n, 0.5f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted in y-direction\n");
    kernel<<<1,1>>>(m, n, 0.0f, 0.5f);
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    CUDA_SAFE_CALL (cudaFree (arr_d));
    return EXIT_SUCCESS;
}

and outputs:

reading array straight
11.000 12.000 13.000
21.000 22.000 23.000
31.000 32.000 33.000
251.000 252.000 253.000
reading array shifted in x-direction
11.502 12.502 13.000
21.502 22.502 23.000
31.502 32.502 33.000
251.502 252.502 253.000
reading array shifted in y-direction
16.000 17.000 18.000
26.000 27.000 28.000
141.000 142.000 143.000
251.000 252.000 253.000

This is as I would expect.

sam_hawker · November 15, 2017, 12:41pm

But if I change it from unsigned char to unsigned short (and change the 255.0f to 65535.0f in the kernel) then it outputs:

reading array straight
11.000 12.000 13.000
21.000 22.000 23.000
31.000 32.000 33.000
251.000 252.000 253.000
reading array shifted in x-direction
12.000 13.000 13.000
22.000 23.000 23.000
32.000 33.000 33.000
252.000 253.000 253.000
reading array shifted in y-direction
16.000 17.000 18.000
26.000 27.000 28.000
141.000 142.000 143.000
251.000 252.000 253.000

This is the behaviour I observed previously.

So, for the unsigned char case, taking the top middle cell with the shift in the x-direction it appears to be doing something like:
(12 * 127 + 13 * 128) / 255 = 12.502

This is already slightly strange because to me the wording in the manual suggests (to me at least) that the divisor should be 256 rather than 255 and that a shift of 0.5 should therefore be exact but its a minor point.

What I don’t understand is what it is doing when I change it to unsigned short. Why is it not still:
(12 * 127 + 13 * 128) / 255 = 12.502
?

sam_hawker · November 15, 2017, 1:10pm

Having added even more decimal places it seems that it is actually (for unsigned char) doing:

(12 * 128 + 13 * 129) / 257 = 12.501945

njuffa · November 15, 2017, 5:26pm

I have passing familiarity with the Rabbit CT benchmark as I have had discussions about it with one of the top contestants (who uses FMAs for accurate interpolation).

It has been a long time since I dove into the exact details of GPU texture interpolation. I think your analysis may be thrown somewhat by the details of conversions from and to internal fixed-point representation. I see to recall the manner of these conversions is prescribed in detail by graphics standards such as OpenGL but my memory is very hazy after a dozen years.

But it should be clear even with the small deviations observed in your examples that the issues with limited granularity are generally consistent with intermediate processing with 8-bit resolution.

sam_hawker · November 15, 2017, 7:01pm

I had a quick look in the OpenGL documentation but all I could find was a note that the weights could be fixed-point with the precision being implementation-specific.

However, this NVIDIA patent is interesting:

It does look like floating-point textures get converted to fixed-point (with a common exponent) and then go through the fixed-point interpolation pipeline rather than the other way around. If we then assume that the fixed-point interpolation pipeline is only wide enough to produce 16-bit of mantissa output then I guess that more or less fits all of the observations.

njuffa · November 15, 2017, 7:09pm

Drawing on my hazy memory, how the various types supported by OpenGL(-ES) should be converted into each other is described in a separate part of the standard, not in the texturing section. Or maybe it is even discussed in bits and pieces in various parts of the spec (I found the organization of the OpenGL spec somewhat confusing).

The important part here is that the interpolation factors in the GPU interpolation can only take one of 257 different values in [0.0, 1.0], so no matter what the type of the output data, the output will be limited to at most 257 different discrete results across the full interpolation range.

Looking at relevant patents can often clarify implementation details not spelled out in official documentation, however (in my layman’s understanding) at least in American jurisdictions it can substantially increase the risk of “willful infringement” of someone’s IP, so as a general rule I don’t do it.

sam_hawker · November 15, 2017, 7:34pm

Besides the minor discrepancy I noted, I think that explains the results for unsigned char: 256 possible input values and if two adjacent pixels differ by 1 then you can get 256 different values between them corresponding to different interpolation factors. So basically, a total of 65536 possible different results.

But for unsigned short I still don’t quite get it. We now have 65536 possible input values and if two adjacent pixels differ by 1 then we can get 256 different values between them. So we should have a total of 16777216 possible different results but we still only have 65536. That’s why I say that there must be a 16-bit limitation somewhere. My guess is that it’s specific to NVIDIA and that I won’t find any mention of it in the OpenGL documentation.

I’ve been advised in the past that not looking at relevant patents can constitute a failure to perform due diligence - I guess you can’t win.

njuffa · November 15, 2017, 7:50pm

I am not a lawyer. In my layman’s understanding, looking at relevant patents for “due diligence” is a task for a lawyer, not an engineer (unless the latter is instructed by the former to do so).

From what I understand, as a creator of IP, knowing less about what other IP creators are doing reduces risk should infringement ever be alleged.

For example, from what I have read, copyright cases in the music industry have found that “subconscious copying” does not protect from a charge of copyright infringement. I am not aware of cases where this has been applied to software, but it seems reasonable to assume that having familiarity with a particular piece of code would likely be seen as making subconscious copying more likely. For that reason I do not study code under GPL (the viral nature of GPL could have disastrous consequences if subconscious copying were found to have occurred).

Topic		Replies	Views
Texture Interpolation with 16-bit integers types rounding to nearest value? CUDA Programming and Performance	2	1849	October 20, 2014
Linear interpolation with textures using unsigned data CUDA Programming and Performance	6	1492	February 20, 2020
Linear interpolation with integer texture. CUDA Programming and Performance	6	2861	August 12, 2022
Repeated 1D interpolation with type promotion CUDA Programming and Performance	3	628	October 12, 2021
Reading R8G8B8A8 texture using tex2D() causes strange result. CUDA Programming and Performance	27	3065	April 28, 2018
Texture Linear Filter doesn't work with uchar! normalizedFloat does not work with uchar, why CUDA Programming and Performance	7	3093	December 16, 2009
texture : unsigned char and FilterModeLinear how to use an unsigned char texture with linear interpo CUDA Programming and Performance	4	12421	November 17, 2009
Bilinear texture interpolation of unsigned char array always returns 1.0 CUDA Programming and Performance	1	920	April 10, 2016
Texture data types Can uchar data put into a float texture? CUDA Programming and Performance	2	7780	December 22, 2010
texture interpolation Tesla precision issue? CUDA Programming and Performance	3	1878	June 27, 2011

Using tex2D for unsigned short/char

Related topics