Bilinear texture interpolation of unsigned char array always returns 1.0

csp256 · April 10, 2016, 8:23pm

I have a grayscale image stored as a flat array of type unsigned char in host memory (later I will be interested in color images). I would like to put it in texture memory on the device and be able to access it at arbitrary (x,y) coordinates while taking advantage of the hardware interpolation. I would like to be able to, between kernel calls, asynchronously copy new data into the device memory which is bound to the texture.

I understand that the result will be normalized to the interval 0…1 interval, and that is fine.

I found some code on stack overflow for my exact question. http://stackoverflow.com/questions/17075617/setting-up-a-cuda-2d-unsigned-char-texture-for-linear-interpolation

It works and I am even able to update the device memory asynchronously. When I try to integrate this into my larger program all calls to tex2D erroneously return 1. I’m really at a loss as to what I am doing that is different from the working example.

My code is broken into a main C file and a CU file.

The main function has:

unsigned char * d_img;
    size_t pitch;
    initImage2(d_img, WIDTH, HEIGHT, &pitch);

Where initImage2 is:

void initImage2(unsigned char * d_img, int width, int height, size_t * pitch) {
    cudaMallocPitch((void**)&d_img, pitch, width*sizeof(*d_img), height);
    image.normalized = false;
    image.filterMode = cudaFilterModeLinear;
    size_t tex_ofs;
    // cudaBindTexture2D (&tex_ofs, &image, d_img, &image.channelDesc, width, height, *pitch); // I've verified that tex_ofs is being set to 0, thats not the error. 
    cudaBindTexture2D (0, &image, d_img, &image.channelDesc, width, height, *pitch);
}

The C code later tries to launch the kernel asynchronously:

latch( img1g.data, image, d_img, pitch, h_K1, d_D1, &numKP1, maxKP, d_K, d_I, &keypoints1, WIDTH, HEIGHT, latchFinished );

Which is implemented as:

void latch( unsigned char* img,
            cudaArray* imageArray,
            unsigned char * d_img2,
            size_t pitch,
            int* h_K,
            unsigned int* d_D,
            int* keypoints,
            int maxKP,
            int* d_K,
            unsigned char* d_img,
            vector<KeyPoint>* vectorKP,
            const int imgWidth,
            const int imgHeight,
            cudaEvent_t latchFinished) {
    // All of these calls are non blocking but serialized.
    cudaMemsetAsync(d_K, -1, maxKP * sizeof(int) * 2); // Negative one is represented by all '1' bits in both int32 and uchar8.
    cudaMemsetAsync(d_D,  0, maxKP * sizeof(int));
    size_t sizeImg = imgWidth * imgHeight * sizeof(unsigned char);
    cudaMemcpyAsync(d_img, img, sizeImg, cudaMemcpyHostToDevice);

    // cudaDeviceSynchronize();
    cudaMemcpy2DAsync(d_img2, pitch, img, imgWidth*sizeof(unsigned char), imgWidth*sizeof(unsigned char), imgHeight, cudaMemcpyHostToDevice);
    // cudaDeviceSynchronize();

    // Only prep up to maxKP for the GPU (as that is the most we have prepared the GPU to handle)
    *keypoints = ((*vectorKP).size() < maxKP) ? (*vectorKP).size() : maxKP;
    for (int i=0; i<*keypoints; i+=1) {
        h_K[2*i  ] = (*vectorKP)[i].pt.x;
        h_K[2*i+1] = (*vectorKP)[i].pt.y;
    }
    for (int i=*keypoints; i<maxKP; i++) {
        h_K[2*i  ] = -1;
        h_K[2*i+1] = -1;
    }

    size_t sizeK = *keypoints * sizeof(int) * 2;
    cudaMemcpyAsync(d_K, h_K, sizeK, cudaMemcpyHostToDevice);

    dim3 threadsPerBlock(_warpSize, warpsPerBlock);
    dim3 blocksPerGrid(*keypoints, 1, 1);
    latch<<<blocksPerGrid, threadsPerBlock>>>(d_img, d_K, d_D, imgWidth, imgHeight);
    cudaEventRecord(latchFinished);
}

The first argument is where the host data actually lives.

Furthermore, I have at file scope in the CU file:

texture<unsigned char, 2, cudaReadModeNormalizedFloat> image;

Inside the kernel I elect a block to print a small chunk of the texture out:

if (blockIdx.x == 0 && threadIdx.x == 0 && threadIdx.y == 0) {
            printf(":: %f \n", (float) g_img[(200)*imgWidth + (200)]);
            register float val;
            for (int row = 199; row < 202; row++) {
                for (int col = 199; col < 202; col++) {
                    val = (float) tex2D(image, col+0.5f, row+0.5f);
                    printf("%f ", val);
                }
                printf("\n");
            }
        }

But the output looks like:

:: 138.000000 
1.000000 1.000000 1.000000 
1.000000 1.000000 1.000000 
1.000000 1.000000 1.000000

I am not sure why. I would expect the middle 1.0 value to be 138/255 = ~0.58. I woke up in the middle of last night and worked on it for a while and went to bed immediately after getting results that looked right (I did not very they were right, but the patches I was outputting were non-constant). I have not been able to reproduce that behavior today. What am I missing?

njuffa · April 10, 2016, 8:35pm

It’s hard to debug third party code from mere snippets. Generally, requests for debugging assistance require the posting of complete buildable and runable code.

I suspect the tex2D lookups are behaving in unexpected ways due to earlier errors that go undetected. The first thing you would want to do is add proper checking of the status returns of all CUDA API calls and all kernel launches. Also, try running the app under control of cuda-memcheck, which can catch a variety of issues (the exact extent of available checks may differ slightly with GPU architecture).

I am not exactly sure how you do your asynchronous copies, but obviously bad things can happen if you update the memory underlying a bound texture while that texture is being worked on by a kernel.

Topic		Replies	Views
Repeated 1D interpolation with type promotion CUDA Programming and Performance	3	576	October 12, 2021
Texture Interpolation with 16-bit integers types rounding to nearest value? CUDA Programming and Performance	2	1797	October 20, 2014
Using tex2D for unsigned short/char CUDA Programming and Performance	14	3678	November 15, 2017
opencv imageData copy to cuda CUDA Programming and Performance	6	8595	May 22, 2012
Linear interpolation with textures using unsigned data CUDA Programming and Performance	6	1389	February 20, 2020
Linear interpolation with integer texture. CUDA Programming and Performance	6	2745	August 12, 2022
CUDA texture object with linear memory seems not to be updated when fetching CUDA Programming and Performance cuda	4	260	June 17, 2024
Understanding CUDA texture 2D linear interpolation CUDA Programming and Performance	4	2804	May 10, 2022
texture interpolation CUDA Programming and Performance	9	13045	September 23, 2009
Setting up for tex1D() How to load CUDA Array for tex1D() CUDA Programming and Performance	5	8813	November 20, 2007

Bilinear texture interpolation of unsigned char array always returns 1.0

Related topics