CUDA texture object with linear memory seems not to be updated when fetching

Hi,

I’m trying to bind a linear memory to my cuda texture object. My kernel function will fetch from the memory and update it directly. I will launch the kernel for a couple of times. However, I noticed that in each kernel launch, the value fetched from the kernel is kept the same, meaning it is not updated after write operation in previous kernel launch. I also wrote a simple test program to show that:
kernel function:

__global__ void kernelFunction(float* output, int width, int height, int depth, cudaTextureObject_t tex)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int z = blockIdx.z * blockDim.z + threadIdx.z;

    if (x < width && y < height && z < depth)
    {
        if(x == 2 && y == 2 && z == 2){
            float value = tex3D<float>(tex, x, y, z);
            printf("prev: %f\n", value);
            output[z * width * height + y * width + x] = value + 1;
        }
    }
}

texture creation:

    int width = 128;
    int height = 128;
    int depth = 128;
    size_t size = width * height * depth * sizeof(float);
    float* inputData = (float*)malloc(size);

    // Allocate device memory for input data
    float* d_inputData;
    cudaMalloc((void**)&d_inputData, size);

    // Set to 0s
    cudaMemset(d_inputData, 0, size);

    // Create texture object
    cudaExtent extent = make_cudaExtent(width, height, depth);
    cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeLinear;
    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

    // Set texture parameters
    cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0] = cudaAddressModeWrap;
    texDesc.addressMode[1] = cudaAddressModeWrap;
    texDesc.addressMode[2] = cudaAddressModeWrap;
    texDesc.filterMode = cudaFilterModePoint;
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 0;
    resDesc.res.linear.devPtr = d_inputData;
    resDesc.res.linear.desc = channelDesc;
    resDesc.res.linear.sizeInBytes = size;

    // Bind texture to linear memory
    cudaTextureObject_t tex;
    cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

My kernel launch:

    dim3 blockSize(8, 8, 8);
    dim3 gridSize((width + blockSize.x - 1) / blockSize.x, (height + blockSize.y - 1) / blockSize.y, (depth + blockSize.z - 1) / blockSize.z);

    for(int i = 0; i < 5; i++){
        kernelFunction<<<gridSize, blockSize>>>(d_inputData, width, height, depth, tex);
        cudaDeviceSynchronize();  // Ensure kernel execution is complete
    }

Based on the above code, the print statement within the kernel function generates the following message:

$ ./test
prev: 0.000000
prev: 0.000000
prev: 0.000000
prev: 0.000000
prev: 0.000000
Value at (2,2,2): 1.000000

We can see that in each iteration it fetches 0, which should not be the case after first launch. And this problem results in my final value 1 instead of 5. Are there anything wrong? I believe there should be a valid way to bind linear array to texture and receive update in each iteration. Thank everyone in advance.

when I read the description of tex3D() in the programming guide I see:

fetches from the CUDA array specified by the three-dimensional texture object

There is no CUDA array in your example.

You might wish to compare that wording with the wording for the tex2D() function:

fetches from the CUDA array or the region of linear memory specified by the two-dimensional texture object

Do you spot the difference?

We can tell that things are fundamentally broken in your example: you already have a host data area set up with inputData pointer. Initialize that region with all float values of 1 (or any other non-zero value). Then copy that to d_inputData, and proceed with the rest of your example. The tex3D still returns 0 !!

The simplePitchLinearTexture sample code may possibly be of interest (or not).

As an aside, the documentation also mentions:

cudaAddressModeWrap and cudaAddressModeMirror are only supported for normalized texture coordinates

Here is an example of something I rapidly hacked together out of the aforementioned sample code showing your kernel (roughly) working in the 2D case:

# cat s.cu
#include <stdio.h>

#include <cuda_runtime.h>

#define checkCudaErrors(x) x

__global__ void kernelFunction(float* output, int width, int height, int depth, cudaTextureObject_t tex)
{
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < width && y < height)
    {
        if(x == 0 && y == 0){
            float value = tex2D<float>(tex, 0, 0);
            printf("prev: %f\n", value);
            output[y * width + x] = value + 1;
        }
    }
}





////////////////////////////////////////////////////////////////////////////////
// Declaration, forward
void runTest(int argc, char **argv);

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv) {
  runTest(argc, argv);
}

////////////////////////////////////////////////////////////////////////////////
//! Run a simple test for CUDA
////////////////////////////////////////////////////////////////////////////////
void runTest(int argc, char **argv) {
  // Set array size
  const int nx = 2048;
  const int ny = 2048;


  // Host allocation and initialization
  float *h_idata = (float *)malloc(sizeof(float) * nx * ny);

  for (int i = 0; i < nx * ny; ++i) {
    h_idata[i] = (float)i + 1;
  }

  // Device memory allocation
  // Pitch linear input data
  float *d_idataPL;
  size_t d_pitchBytes;

  checkCudaErrors(cudaMallocPitch((void **)&d_idataPL, &d_pitchBytes,
                                  nx * sizeof(float), ny));

  cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();


  // Pitch linear
  size_t h_pitchBytes = nx * sizeof(float);

  checkCudaErrors(cudaMemcpy2D(d_idataPL, d_pitchBytes, h_idata, h_pitchBytes,
                               nx * sizeof(float), ny, cudaMemcpyHostToDevice));


  cudaTextureObject_t texRefPL;
  cudaResourceDesc texRes;
  memset(&texRes, 0, sizeof(cudaResourceDesc));

  texRes.resType = cudaResourceTypePitch2D;
  texRes.res.pitch2D.devPtr = d_idataPL;
  texRes.res.pitch2D.desc = channelDesc;
  texRes.res.pitch2D.width = nx;
  texRes.res.pitch2D.height = ny;
  texRes.res.pitch2D.pitchInBytes = h_pitchBytes;
  cudaTextureDesc texDescr;
  memset(&texDescr, 0, sizeof(cudaTextureDesc));

  texDescr.normalizedCoords = true;
  texDescr.filterMode = cudaFilterModePoint;
  texDescr.addressMode[0] = cudaAddressModeWrap;
  texDescr.addressMode[1] = cudaAddressModeWrap;
  texDescr.readMode = cudaReadModeElementType;

  checkCudaErrors(cudaCreateTextureObject(&texRefPL, &texRes, &texDescr, NULL));
  for (int i = 0; i < 5; i++){
    kernelFunction<<<1, dim3(8,8)>>>(d_idataPL, nx, ny, 1, texRefPL);
    cudaDeviceSynchronize();}
}
# nvcc -o s s.cu
# compute-sanitizer ./s
========= COMPUTE-SANITIZER
prev: 1.000000
prev: 2.000000
prev: 3.000000
prev: 4.000000
prev: 5.000000
========= ERROR SUMMARY: 0 errors
#

Its not a trivial matter to correctly identify coordinates in the normalized coordinate case, so you may wish to dispense with the wrap border mode and convert to non-normalized coords.

1 Like

Thanks. Since tex3D does now work with linear memory, my goal is then to utilize texture for faster read and keep writing to global memory. The way I envision it is that I create a cudaArray_t associated with a linear device pointer and create a cudaTextureObject_t with that array. In that case, I should be able to directly update the device pointer as writing to global memory and get updated result in the next texture fetch. I wonder if it is the case. (Edit: I give it a try and it seems it does not work. Is there really a way to support direct update to global memory rather than using surface object?)
Then I guess I will use texture with surface. The issues I am having right now are actually degraded performance and mismatched results. It slows down the entire program by ~100 times, and the result (when I use cudaMemcpy3dParams to copy back to host) is wrong. Could you provide an explanation to why it becomes so slow after using surface?

No, that won’t work. when you create a cudaArray, it represents its own, opaque memory region, and is not associated with any linear device pointer. Sure, you can copy to it, but that does not create an association with the linear memory that you copied from; they are two separate entities.

If you want to stay with tex3D your only options are texture plus surface, and there is a CUDA sample code that shows how a (2D) texture and a surface could be used together.

Any 3D memory region associated with a linear pointer could be handled in a 2D fashion the way I have already demonstrated.

I don’t have any other ideas.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.