Repeated 1D interpolation with type promotion

lukas.glandorf · July 14, 2020, 2:46pm

Dear community,

my problem consists of of multiple data sequences with the same size that my measurement hardware returns as "unsigned short"s. The sequences are are stored sequentially as a long 1D array. Each sequence has to be resampled by interpolation for which the new coordinates (floats) are equal for all sequences and are stored in another 1D array of the same size as one sequence. Further data processing is done on the GPU after the resampling.

I have already written a working interpolation kernel, however, the interpolation remains the main bottleneck. Because this is carried out on an older GPU (Geforce 750) I thought it would be worth a try and check if texture based interpolation performs faster.
If I convert the data from unsigned short to float type on the GPU and move this to the GPU for texture interpolation, everything works fine. I have attached a minimal working example for this below.

#include <iostream>
#include <cuda.h>
#include <cstdlib>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

// attempt to interpolate linear memory
__global__
void cuda_texture_interpolate(cudaTextureObject_t tex,
                              int count,
                              int size,
                              float* d_map) {
    const int start_idx = (blockIdx.x * size + threadIdx.x)*(count);
    if (start_idx < size) {
        if (count < 1) { count = 1; }
        float x;
        float y;
        printf("count is: %d and start is %d an size is %d\n", count, start_idx, size);
        for (int i = 0; i != count; i++) {
            x = (d_map[i]+0.5+start_idx);
            y = tex1D<float>(tex, x);
            printf("ID: %d, x: %f ; y: %f\n", threadIdx.x, x, y);
        }
    }
}


int main() {
    const int n = 10;
    const int map_length = 5;
    float a_host[n] = {11,14,17,13,15,11,14,17,13,15};

    float m_host[map_length] = {0, 0.4288, 0.8615, 1.2980, 1.7384};

    // allocate and copy to cuda array and device memory
    cudaChannelFormatDesc channelDesc =
            cudaCreateChannelDesc(32, 0, 0, 0,
                                  cudaChannelFormatKindFloat);
    cudaArray* cuArray;
    float* d_map;
    CUDA_SAFE_CALL(cudaMalloc((void**)&d_map, sizeof(float)*map_length));
    CUDA_SAFE_CALL(cudaMallocArray(&cuArray, &channelDesc, n));

    // Copy to device memory
    CUDA_SAFE_CALL(cudaMemcpyToArray(cuArray, 0, 0, a_host, n*sizeof(float),
                      cudaMemcpyHostToDevice));
    CUDA_SAFE_CALL(cudaMemcpy(d_map,m_host, sizeof(float)*map_length, cudaMemcpyHostToDevice));

    // create texture object
    cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;

    cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0]   = cudaAddressModeClamp;
    texDesc.filterMode       = cudaFilterModeLinear;
    texDesc.readMode         = cudaReadModeElementType;
//    texDesc.normalizedCoords = 1;
    texDesc.normalizedCoords = 0;


    cudaResourceViewDesc resViewDesc;
    memset(&resViewDesc, 0, sizeof(resViewDesc));
    resViewDesc.format = cudaResViewFormatFloat1;
    resViewDesc.width = n;

    // create texture object
    cudaTextureObject_t tex;
    CUDA_SAFE_CALL(cudaCreateTextureObject(&tex, &resDesc, &texDesc, &resViewDesc));

    // call interpolation kernel
    cuda_texture_interpolate<<<2,2>>>(tex,map_length,n, d_map);
    CHECK_LAUNCH_ERROR();
    // clean up
    CUDA_SAFE_CALL(cudaDestroyTextureObject(tex));
    CUDA_SAFE_CALL(cudaFreeArray(cuArray));
    CUDA_SAFE_CALL(cudaFree(d_map));

    printf("end of texture_object_interpolation.\n");
    }

I thought it would be really need to utilize the texture interpolation to also take care of the type conversion from unsigned short to float (as for example done here. However, for both my code, as well as when I change the copied and successfully tested 2D example implementation to a 1D case, the code does not compile due to an “invalid argument” error when I try to create the texture object. Does anyone have any idea how to fix this or am I missing an obvious limitation that prohibits my preferred use case?

Here is the unsuccessful code:


// attempt to interpolate linear memory
__global__
void cuda_texture_interpolate(cudaTextureObject_t tex,
                              int count,
                              int size,
                              float* d_map) {
    const int start_idx = (blockIdx.x * size + threadIdx.x)*(count);
    if (start_idx < size) {
        if (count < 1) { count = 1; }
        float x;
        float y;
        printf("count is: %d and start is %d an size is %d\n", count, start_idx, size);
        for (int i = 0; i != count; i++) {
            x = (d_map[i]+0.5+start_idx);
            y = tex1D<float>(tex, x)*65536.0f;
            printf("ID: %d, x: %f ; y: %f\n", threadIdx.x, x, y);
        }
    }
}


int main() {
    const int n = 10;
    const int map_length = 5;
    unsigned short a_host[n] = {11,14,17,13,15,11,14,17,13,15};

    float m_host[map_length] = {0, 0.4288, 0.8615, 1.2980, 1.7384};

    // allocate and copy to cuda array and device memory
    cudaChannelFormatDesc channelDesc =
            cudaCreateChannelDesc(16, 0, 0, 0,
                                  cudaChannelFormatKindUnsigned);
    cudaArray* cuArray;
    float* d_map;
    CUDA_SAFE_CALL(cudaMalloc((void**)&d_map, sizeof(float)*map_length));
    CUDA_SAFE_CALL(cudaMallocArray(&cuArray, &channelDesc, n));

    // Copy to device memory
    CUDA_SAFE_CALL(cudaMemcpyToArray(cuArray, 0, 0, a_host, n*sizeof(unsigned short),
                      cudaMemcpyHostToDevice));
    CUDA_SAFE_CALL(cudaMemcpy(d_map,m_host, sizeof(float)*map_length, cudaMemcpyHostToDevice));

    // create texture object
    cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;

    cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0]   = cudaAddressModeClamp;
    texDesc.filterMode       = cudaFilterModeLinear;
    texDesc.readMode         = cudaReadModeNormalizedFloat;
//    texDesc.normalizedCoords = 1;
    texDesc.normalizedCoords = 0;


    cudaResourceViewDesc resViewDesc;
    memset(&resViewDesc, 0, sizeof(resViewDesc));
    resViewDesc.format = cudaResViewFormatFloat1;
    resViewDesc.width = n;

    // create texture object
    cudaTextureObject_t tex;
    CUDA_SAFE_CALL(cudaCreateTextureObject(&tex, &resDesc, &texDesc, &resViewDesc));

    // call interpolation kernel
    cuda_texture_interpolate<<<2,2>>>(tex,map_length,n, d_map);
    CHECK_LAUNCH_ERROR();
    // clean up
    CUDA_SAFE_CALL(cudaDestroyTextureObject(tex));
    CUDA_SAFE_CALL(cudaFreeArray(cuArray));
    CUDA_SAFE_CALL(cudaFree(d_map));

    printf("end of texture_object_interpolation.\n");
}

I suspect this error to be caused by a wrong combination when I try to promote the type. Thanks in advance!

njuffa · July 14, 2020, 5:25pm

When accessing a non-float texture for interpolation one needs to specify cudaReadModeNormalizedFloat, not cudaReadModeElementType, since the input to the interpolation process needs to be floating-point. However, even using that I cannot get tex1D() to do what’s desired. It interpolates with float textures only, best I can tell.

I have a vague recollection of having a run-in with frustrating tex1D limitations before, and have not used it ever since (a dozen years ago!), only tex1Dfetch() and tex2D().

I would suggest using tex2D() with a 2D texture of height 1 instead.

Note that it is typically not necessary to dive into the detailed texture controls when using texture references, specifying channel descriptors and the like. Below is the initial code I used to play with tex1D() interpolation. Useful for a quick round of tests, but texture objects are of course less of a kludge than texture references and preferred these days. I would recommend neither for plain old compute work though, and haven’t used textures since Kepler times.

I assume you are aware that GPU hardware interpolation uses 8-bit quantization, which produces unacceptable artifacts in many use cases outside graphics. The details of the fixed-point computation used for hardware texture interpolation are documented in an appendix to the CUDA Programming Guide.

#include <stdio.h>
#include <stdlib.h>

#define N 5

texture<float, 1, cudaReadModeElementType> tex;

__global__ void kernel (float shift)
{
    for (int i = 0; i < N; i++) {
        printf ("%15.8e\n", tex1D (tex, i + 0.5f + shift));
    }
}

int main (void) 
{
    float data[N] = {1, 2, 3, 4, 5};
    cudaArray* data_d = 0; 
    cudaMallocArray (&data_d, &tex.channelDesc, N, 1);
    cudaMemcpyToArray (data_d, 0, 0, data, sizeof(float) * N, cudaMemcpyHostToDevice);
    cudaBindTextureToArray (tex, data_d);
    tex.normalized = false;
    tex.filterMode = cudaFilterModeLinear;
    printf ("reading array straight:\n");
    kernel<<<1,1>>>(0.0f);
    cudaDeviceSynchronize();
    printf ("reading array shifted by -0.3:\n");
    kernel<<<1,1>>>(-0.3f);
    cudaDeviceSynchronize();
    printf ("reading array shifted by +0.3:\n");
    kernel<<<1,1>>>(0.3f);
    cudaDeviceSynchronize();
    cudaUnbindTexture (tex);
    cudaFree (data_d);
    return EXIT_SUCCESS;
}

lukas.glandorf · July 23, 2020, 2:13pm

Thanks njuffa! It seems like the ReadMode variable error slipped by me when I played around with it and I would have probably butted my head a lot more with the tex1D interpolation…

I implemented your suggestion with tex2D and height of 1, nice workaround! I was aware of the 8-bit quantization. While that would be a problem for the actual post-processing, the goal of my current implementation is to visualize the3d image data while it is being recorded. For my data, the 8-bit quantization pretty much interpolates to the closest integer value when compared to interpolation with better accuracy which is fine. I also get a decent speed increase of approx. 30% for small arrays due to smaller memCpy, and combined type promotion+interpolation.

However, it seems I didn’t pay enough attention to the maximum size of the cudaArray to which I bind the texture and am running into it much to early for my application. The biggest I can go is to allocate an array of 8x8x2048 unsigned shorts and the goal is to get to 256x256x2048 which would be approx. 256MB.

Nevertheless, thanks for the help again!

Topic		Replies	Views
Using tex2D for unsigned short/char CUDA Programming and Performance	14	3621	November 15, 2017
Linear interpolation with integer texture. CUDA Programming and Performance	6	2681	August 12, 2022
Output of 2D texture memory is zero CUDA Programming and Performance	9	968	March 30, 2021
Bilinear texture interpolation of unsigned char array always returns 1.0 CUDA Programming and Performance	1	867	April 10, 2016
tex3D<float>... not able to to calculate from a code CUDA Programming and Performance	0	913	November 28, 2018
Some questions about texture memory CUDA Programming and Performance	8	1508	March 5, 2019
Using texture memory over iterations causes incorrect read/write of some lines CUDA Programming and Performance cuda	2	510	September 2, 2020
Understanding CUDA texture 2D linear interpolation CUDA Programming and Performance	4	2610	May 10, 2022
new: cubic interpolation in CUDA cubic B-spline interpolation CUDA Programming and Performance	19	43074	July 20, 2023
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14587	January 29, 2013

Repeated 1D interpolation with type promotion

Related topics