Bilinear texture interpolation of unsigned char array always returns 1.0

I have a grayscale image stored as a flat array of type unsigned char in host memory (later I will be interested in color images). I would like to put it in texture memory on the device and be able to access it at arbitrary (x,y) coordinates while taking advantage of the hardware interpolation. I would like to be able to, between kernel calls, asynchronously copy new data into the device memory which is bound to the texture.

I understand that the result will be normalized to the interval 0…1 interval, and that is fine.

I found some code on stack overflow for my exact question. http://stackoverflow.com/questions/17075617/setting-up-a-cuda-2d-unsigned-char-texture-for-linear-interpolation

It works and I am even able to update the device memory asynchronously. When I try to integrate this into my larger program all calls to tex2D erroneously return 1. I’m really at a loss as to what I am doing that is different from the working example.

My code is broken into a main C file and a CU file.

The main function has:

unsigned char * d_img;
    size_t pitch;
    initImage2(d_img, WIDTH, HEIGHT, &pitch);

Where initImage2 is:

void initImage2(unsigned char * d_img, int width, int height, size_t * pitch) {
    cudaMallocPitch((void**)&d_img, pitch, width*sizeof(*d_img), height);
    image.normalized = false;
    image.filterMode = cudaFilterModeLinear;
    size_t tex_ofs;
    // cudaBindTexture2D (&tex_ofs, &image, d_img, &image.channelDesc, width, height, *pitch); // I've verified that tex_ofs is being set to 0, thats not the error. 
    cudaBindTexture2D (0, &image, d_img, &image.channelDesc, width, height, *pitch);
}

The C code later tries to launch the kernel asynchronously:

latch( img1g.data, image, d_img, pitch, h_K1, d_D1, &numKP1, maxKP, d_K, d_I, &keypoints1, WIDTH, HEIGHT, latchFinished );

Which is implemented as:

void latch( unsigned char* img,
            cudaArray* imageArray,
            unsigned char * d_img2,
            size_t pitch,
            int* h_K,
            unsigned int* d_D,
            int* keypoints,
            int maxKP,
            int* d_K,
            unsigned char* d_img,
            vector<KeyPoint>* vectorKP,
            const int imgWidth,
            const int imgHeight,
            cudaEvent_t latchFinished) {
    // All of these calls are non blocking but serialized.
    cudaMemsetAsync(d_K, -1, maxKP * sizeof(int) * 2); // Negative one is represented by all '1' bits in both int32 and uchar8.
    cudaMemsetAsync(d_D,  0, maxKP * sizeof(int));
    size_t sizeImg = imgWidth * imgHeight * sizeof(unsigned char);
    cudaMemcpyAsync(d_img, img, sizeImg, cudaMemcpyHostToDevice);

    // cudaDeviceSynchronize();
    cudaMemcpy2DAsync(d_img2, pitch, img, imgWidth*sizeof(unsigned char), imgWidth*sizeof(unsigned char), imgHeight, cudaMemcpyHostToDevice);
    // cudaDeviceSynchronize();

    // Only prep up to maxKP for the GPU (as that is the most we have prepared the GPU to handle)
    *keypoints = ((*vectorKP).size() < maxKP) ? (*vectorKP).size() : maxKP;
    for (int i=0; i<*keypoints; i+=1) {
        h_K[2*i  ] = (*vectorKP)[i].pt.x;
        h_K[2*i+1] = (*vectorKP)[i].pt.y;
    }
    for (int i=*keypoints; i<maxKP; i++) {
        h_K[2*i  ] = -1;
        h_K[2*i+1] = -1;
    }

    size_t sizeK = *keypoints * sizeof(int) * 2;
    cudaMemcpyAsync(d_K, h_K, sizeK, cudaMemcpyHostToDevice);

    dim3 threadsPerBlock(_warpSize, warpsPerBlock);
    dim3 blocksPerGrid(*keypoints, 1, 1);
    latch<<<blocksPerGrid, threadsPerBlock>>>(d_img, d_K, d_D, imgWidth, imgHeight);
    cudaEventRecord(latchFinished);
}

The first argument is where the host data actually lives.

Furthermore, I have at file scope in the CU file:

texture<unsigned char, 2, cudaReadModeNormalizedFloat> image;

Inside the kernel I elect a block to print a small chunk of the texture out:

if (blockIdx.x == 0 && threadIdx.x == 0 && threadIdx.y == 0) {
            printf(":: %f \n", (float) g_img[(200)*imgWidth + (200)]);
            register float val;
            for (int row = 199; row < 202; row++) {
                for (int col = 199; col < 202; col++) {
                    val = (float) tex2D(image, col+0.5f, row+0.5f);
                    printf("%f ", val);
                }
                printf("\n");
            }
        }

But the output looks like:

:: 138.000000 
1.000000 1.000000 1.000000 
1.000000 1.000000 1.000000 
1.000000 1.000000 1.000000

I am not sure why. I would expect the middle 1.0 value to be 138/255 = ~0.58. I woke up in the middle of last night and worked on it for a while and went to bed immediately after getting results that looked right (I did not very they were right, but the patches I was outputting were non-constant). I have not been able to reproduce that behavior today. What am I missing?

It’s hard to debug third party code from mere snippets. Generally, requests for debugging assistance require the posting of complete buildable and runable code.

I suspect the tex2D lookups are behaving in unexpected ways due to earlier errors that go undetected. The first thing you would want to do is add proper checking of the status returns of all CUDA API calls and all kernel launches. Also, try running the app under control of cuda-memcheck, which can catch a variety of issues (the exact extent of available checks may differ slightly with GPU architecture).

I am not exactly sure how you do your asynchronous copies, but obviously bad things can happen if you update the memory underlying a bound texture while that texture is being worked on by a kernel.