Output of 2D texture memory is zero

Hi,
I’m trting to bind a pointer in my device global memory to a texture memory so i can do 2D interpolation. However, when i load from the texture memory, everything is zero.
here is my code.

 #include <cuda_runtime.h>
 #include "device_launch_parameters.h"
 #include <stdio.h>
 #include "cuda.h"

texture<float, 2, cudaReadModeElementType> tex;
 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
 {
if (code != cudaSuccess)
{
    fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort) exit(code);
  }
 }
 __global__ void kernel_ArrivalTimeCalculation(float* Device_ConvArrivalTime) {
int TID = threadIdx.y * blockDim.x + threadIdx.x;
int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
int GID_RowBased = BlockOFFset + TID;
int RowOFFset = blockDim.x * blockDim.y * gridDim.x * blockIdx.y;
int GID = RowOFFset + BlockOFFset + TID;
Device_ConvArrivalTime[GID_RowBased] = (float)(GID/2);
  }
 __global__ void kernel_ArrivalTimeCalculation_Show()  {
   int TID = threadIdx.y * blockDim.x + threadIdx.x;
     int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
     int GID_RowBased = BlockOFFset + TID;
     int  Px_man = (GID_RowBased) % (256);
      int Pz_man = (GID_RowBased) / (256);
      float value;
        value = tex2D(tex, (float)(Px_man/3.0),  (float)(Pz_man/3.0) );
        printf("The value in the texture memory: %.6f \n", value);
         }

   int main()
    {
float* Device_ConvArrivalTime;  // Device pointer
int m = 512*96;
int n = 256;
size_t pitch, tex_ofs;
gpuErrchk(cudaMallocPitch((void**)&Device_ConvArrivalTime, &pitch, n * sizeof(float), m));
tex.normalized = false;
tex.filterMode = cudaFilterModeLinear;
gpuErrchk(cudaBindTexture2D(&tex_ofs, &tex, Device_ConvArrivalTime, &tex.channelDesc, n, m, pitch));
dim3 block(1024, 1);
dim3 grid((m * n / block.x), 1);
kernel_ArrivalTimeCalculation << <grid, block>> > (Device_ConvArrivalTime);
cudaDeviceSynchronize();
kernel_ArrivalTimeCalculation_Show << <1, 256>> > ();
cudaDeviceSynchronize();
gpuErrchk(cudaFree(Device_ConvArrivalTime));
cudaDeviceReset();
return 0;}

So, the problem is that i get 0 printed. what could be wrong here?

Moein.

Textures are designed for read-only access. So you need to store data in a data object, bind the texture to that data object, then retrieve data from texture.

Here, the data underlying the texture tex is stored in Device_ConvArrivalTime. You need to initialize that data before you read it out via texture. I don’t see any code that performs that initialization.

There may be other bugs in your code, I did not study it in detail.

Hi,
the “Device_ConvArrivalTime” is initialized in"kernel_ArrivalTimeCalculation " and then read out via texture in "kernel_ArrivalTimeCalculation_Show ". Is there something wrong?

Seems I got confused between kernel_ArrivalTimeCalculation and kernel_ArrivalTimeCalculation_Show when perusing the code …

You might want to debug in two steps:

(1) After kernel_ArrivalTimeCalculation, read back the data in Device_ConvArrivalTime without using texture to make sure it is as you expect.

(2) When you read back the texture, check carefully whether (a) the indexes are in-range so you don’t hit a clamp-to-border case (b) the texture indexing matches the indexing mode selected (normalized vs unnormalized) (c) make sure you hit the middle of each texel.

FWIW, the division by three in your tex2D calls looks unusual / suspicious to me. There should be several examples of tex2D usage in these forums. I know I have posted a few over the years. It may be best to start with a known-good example and extend it.

Note that old-style texture references are currently deprecated and will likely disappear with the next major CUDA release. If you are starting with textures now, it’s probably best to work with texture objects from the very start. For a quick introduction, checkout this post:

https://developer.nvidia.com/blog/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/

yes, i’m starting to work with Texture memory. I have seen this post before. so, this is what is stated in this post:

  float *buffer;
  cudaMalloc(&buffer, N*sizeof(float));
   // create texture object
   cudaResourceDesc resDesc;
   memset(&resDesc, 0, sizeof(resDesc));
   resDesc.resType = cudaResourceTypeLinear;
   resDesc.res.linear.devPtr = buffer;
   resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
   resDesc.res.linear.desc.x = 32; // bits per channel
   resDesc.res.linear.sizeInBytes = N*sizeof(float);
   cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
   texDesc.readMode = cudaReadModeElementType;
    // create texture object: we only have to do this once!
    cudaTextureObject_t tex=0;
    cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

buffer is 1D, but my “Device_ConvArrivalTime” is 2D with m rows and n columns. could you please tell me what changes i need to apply to this object code? I have already used this object for 1D and it works fine, but never tried for a 2D case.

I have not used textures in years; I would have to consult the documentation just like you. There may be a worked example among the sample apps that ship with CUDA.

Can we not use the middle of texture? I’m saying this because I want to use the texture coordinates for normal addressing and if i want to do interpolation, then I would use the middle as well.

For unnormalized texture coordinates, you would add 0.5 to hit the middle of each texel. This is orthogonal to whatever you do with interpolation. There are worked examples of 2D texture interpolation using texture references in these forums.

Here is some code I just scraped from my hard disk:

#include <stdlib.h>
#include <stdio.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;

__global__ void kernel (int m, int n, float shift_x, float shift_y) 
{
    float val;
    for (int row = 0; row < m; row++) {
        for (int col = 0; col < n; col++) {
            val = 255.0 * tex2D (tex, col+0.5f+shift_x, row+0.5f+shift_y);
            printf ("%12.5f  ", val);
        }
        printf ("\n");
    }
}

int main (void)
{
    int m = 4; // height = #rows
    int n = 3; // width  = #columns
    size_t pitch, tex_ofs;
    unsigned char arr[4][3]= {{10,20,30},{40,50,60},{70,80,90},{100,110,120}};
    unsigned char *arr_d = 0;

    CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d,&pitch,n*sizeof(*arr_d),m));
    CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
                                n*sizeof(arr[0][0]),m,cudaMemcpyHostToDevice));
    tex.normalized = false;
    tex.filterMode = cudaFilterModeLinear;
    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &tex.channelDesc,
                                       n, m, pitch));
    if (tex_ofs !=0) {
        printf ("tex_ofs = %zu\n", tex_ofs);
        return EXIT_FAILURE;
    }
    printf ("reading array straight\n");
    kernel<<<1,1>>>(m, n, 0.0f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted 0.5 in x-direction\n");
    kernel<<<1,1>>>(m, n, 0.5f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted 0.5 in y-direction\n");
    kernel<<<1,1>>>(m, n, 0.0, -0.5f);
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    CUDA_SAFE_CALL (cudaFree (arr_d));
    return EXIT_SUCCESS;
}

Thank you for posting this. I also found this example, which explain how to do it with a 2D texture object: https://stackoverflow.com/questions/54098747/cuda-how-to-create-2d-texture-object

I used this exmple and modified my code accordingly.

#include <cuda_runtime.h>
 #include "device_launch_parameters.h"
#include <stdio.h>
#include "cuda.h"

 //texture<float, 2, cudaReadModeElementType> tex;

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
 {
if (code != cudaSuccess)
{
    fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
    if (abort) exit(code);
}
 }
  __global__ void kernel_ArrivalTimeCalculation(float* Device_ConvArrivalTime, int transmit, int size) {
int TID = threadIdx.y * blockDim.x + threadIdx.x;
int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
int GID_RowBased = BlockOFFset + TID;
int RowOFFset = blockDim.x * blockDim.y * gridDim.x * blockIdx.y;
int GID = RowOFFset + BlockOFFset + TID;
Device_ConvArrivalTime[transmit * size + GID_RowBased] = (float)(transmit*20+GID/3)/10000;  // just to load the 
  Device_ConvArrivalTime
  }
 __global__ void kernel_ArrivalTimeCalculation_Show(float* Device_ConvArrivalTime, int transmit, int size, 
cudaTextureObject_t tex, int NumOFPixelX, int NumOFPixelZ, int NumOfSensor) {
int TID = threadIdx.y * blockDim.x + threadIdx.x;
int BlockOFFset = blockDim.x * blockDim.y * blockIdx.x;
int GID_RowBased = BlockOFFset + TID;
int  Px_man = (GID_RowBased) % (NumOFPixelX);
int Pz_man = (GID_RowBased) / (NumOFPixelX);
float value_Tex, value_Global;
value_Tex = tex2D<float>(tex, Px_man+0.5f, transmit*NumOFPixelZ+Pz_man + 0.5f);
value_Global = Device_ConvArrivalTime[transmit * size+ GID_RowBased];
printf("transmit: %d, value_Tex: %.6f, value_Global: %.6f \n", transmit,value_Tex, value_Global);
 }
int main() {
float* Device_ConvArrivalTime;  // Device pointer
cudaTextureObject_t tex;
int NumOfSensor = 96;
int NumOFPixelZ = 512;
int NumOFPixelX = 256;
int num_rows = NumOfSensor * NumOFPixelZ;
int num_cols = NumOFPixelX;
int devNo = 0;
cudaDeviceProp iProp;
cudaGetDeviceProperties(&iProp, devNo);
if (num_cols % iProp.texturePitchAlignment != 0) {
    printf("Improper number of columns. it should be a multiplication of  %lu \n", iProp.texturePitchAlignment);
}
size_t pitch;
gpuErrchk(cudaMallocPitch((void**)&Device_ConvArrivalTime, &pitch, num_cols * sizeof(float), num_rows));
struct cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypePitch2D;
resDesc.res.pitch2D.devPtr = Device_ConvArrivalTime;
resDesc.res.pitch2D.width = num_cols;
resDesc.res.pitch2D.height = num_rows;
resDesc.res.pitch2D.desc = cudaCreateChannelDesc<float>();
resDesc.res.pitch2D.pitchInBytes = pitch;
//resDesc.res.pitch2D.pitchInBytes = num_cols * sizeof(float);
struct cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

dim3 block(1024, 1);
int size = NumOFPixelZ * NumOFPixelX;
dim3 grid((size / block.x), 1);
for (int transmit = 0; transmit < NumOfSensor; transmit++) { //NumOfSensor
    kernel_ArrivalTimeCalculation << <grid, block >> > (Device_ConvArrivalTime, transmit,size);
}
cudaDeviceSynchronize();

for (int transmit = 0; transmit < 2; transmit++) { //NumOfSensor
    kernel_ArrivalTimeCalculation_Show << <1, 20 >> > (Device_ConvArrivalTime, transmit, size, tex,NumOFPixelX,NumOFPixelZ, NumOfSensor);
}
cudaDeviceSynchronize();

gpuErrchk(cudaFree(Device_ConvArrivalTime));
cudaDeviceReset();

return 0;}

here is an image from the output:

So, for transmit=0, it works fine, but not for transmit=1. I double checked every indexing, but could not find which part is wrong. Any idea?

I think the prblem is with cudaMallocPitch. I treat with Device_ConvArrivalTime as a 1D memory, and the texture memory is 2D. So, maybe Device_ConvArrivalTime is not correctly mapped to the texture memory which causes this error?