Writing to texture memory using offset 2D texture memory, offsets

Hello,

I am using 2D texture memory on the GPU, and within the kernel running on the device I have a pointer (float *nextp) that has been bound to 2D texture memory.

As shown in the code snippet below, I would like to find offsets for cells (i,j), (i,j+1), (i+1,j), and (i+1,j+1). How would I determine these offsets, or is there a way to write to 2D texture memory in another way?

// coordinates

	 int i = threadIdx.x + blockIdx.x * blockDim.x;

	 int j = threadIdx.y + blockIdx.y * blockDim.y;

	 

	 int offset_11 = i + j * blockDim.x * gridDim.x; // (i,j)

	 int offset_12 = offset_11 + 1;                  // (i,j+1)

	 int offset_21 =                                 // (i+1,j)

	 int offset_22 =                                 // (i+1,j+1)

	

	 nextp[offset_11] = 1.0;

	 nextp[offset_12] = 1.0;

	 nextp[offset_21] = 1.0;

	 nextp[offset_22] = 1.0;

Hello,

I am using 2D texture memory on the GPU, and within the kernel running on the device I have a pointer (float *nextp) that has been bound to 2D texture memory.

As shown in the code snippet below, I would like to find offsets for cells (i,j), (i,j+1), (i+1,j), and (i+1,j+1). How would I determine these offsets, or is there a way to write to 2D texture memory in another way?

// coordinates

	 int i = threadIdx.x + blockIdx.x * blockDim.x;

	 int j = threadIdx.y + blockIdx.y * blockDim.y;

	 

	 int offset_11 = i + j * blockDim.x * gridDim.x; // (i,j)

	 int offset_12 = offset_11 + 1;                  // (i,j+1)

	 int offset_21 =                                 // (i+1,j)

	 int offset_22 =                                 // (i+1,j+1)

	

	 nextp[offset_11] = 1.0;

	 nextp[offset_12] = 1.0;

	 nextp[offset_21] = 1.0;

	 nextp[offset_22] = 1.0;

Using pitch linear textures, you can update the underlying memory as you described.
The offset for the next line in the texture(y → y+1) is pitch bytes then. (Look at the “pitchLinear” SDK example and cudaMallocPitch)

Other alternative: When the 2D-texture is bound to a cuda array, use a cuda surface to modify the array data(Cuda 3.0+).

Using pitch linear textures, you can update the underlying memory as you described.
The offset for the next line in the texture(y → y+1) is pitch bytes then. (Look at the “pitchLinear” SDK example and cudaMallocPitch)

Other alternative: When the 2D-texture is bound to a cuda array, use a cuda surface to modify the array data(Cuda 3.0+).

Thanks Nighthawk13; this is greatly appreciated! External Image

Thanks Nighthawk13; this is greatly appreciated! External Image

As suggested, I’ve been attempting to use a CUDA surface to modify the array data. I’ve bound the 2D texture to a CUDA array. But then how would I use a CUDA surface to modify the array? I’ve done the following on the host. I call the mallocMemory_alt() function:

// global variables

cudaArray *pnplus1;

texture<float,2,cudaReadModeElementType> pnplus1_GPU;

CUsurfref  pnplus1_ref;

void mallocMemory_alt(int Nx, int Ny)

{

	int ts = sizeof(float) * Nx * Ny;

	cudaChannelFormatDesc description = cudaCreateChannelDesc<float>();

	

	cudaMallocArray (  &pnplus1,

                        &description,

                        Nx, Ny ) );

	

	cuSurfRefSetArray(pnplus1_ref, pnplus1, 0);

	

	cudaBindTextureToArray(pnplus1_GPU, pnplus1, description) )	

	

} // end of function

But then using the CUDA surface, how do I update cudaArray *pnplus1 from within the kernel code? In the kernel code, I would like to assign some value to the {i,j} element in the pnplus1 array:

pnplus1[i][j] = 1.0;

Of course, I may have to use 1D indexing to access the array element, but the above code conveys a similar idea. Is there a good example on how to use a CUDA surface to access the CUDA array data from within the device kernel?

As suggested by Nighthawk13, pitch linear textures do indeed work as expected, and the simplePitchLinearTexture SDK example provides much insight. For code that runs on the device, the memory offsets are accessed in the following fashion:

int i = threadIdx.x + blockIdx.x * blockDim.x;

                int j = threadIdx.y + blockIdx.y * blockDim.y;

	        nextp[j*pitch + i] = tex2D( tex_p, i/(float)nx, j/(float)ny ); // ( i, j )

		nextp[j*pitch + i + 1] = tex2D( tex_p, i/(float)nx, j/(float)ny ); // ( i, j+1 )

		nextp[j*pitch + i + pitch] = tex2D( tex_p, i/(float)nx, j/(float)ny ); // ( i+1,j )

		nextp[j*pitch + i + pitch + 1] = tex2D( tex_p, i/(float)nx, j/(float)ny ); // ( i+1,j+1 )

As shown in the SDK example, normalized coordinates are used. In the above code snippet, the pitch is passed into the function that runs on the device, and the nx and the ny are the dimensions of the grid.

I’m not sure what you’re trying to do since all of 4 of your nextp’s are set to the same texture value. The compiler should optimize out the 4 texture reads anyway.

Are you sure your array indexing is what you think it is? You have (re-written for clarity)

        nextp[j*pitch + i]           // ( i, j )

	nextp[j*pitch + (i+1) ]      // ( i, j+1 ) ??

	nextp[(j+1)*pitch + i]       // ( i+1,j )  ??

	nextp[(j+1)*pitch + (i+1)]   // ( i+1,j+1 )

Yes, you are right; thank you for reviewing my code. I accidentally cut and pasted this code snippet from another file that I didn’t use. Here is the updated version showing how a 2D texture array can be accessed with tex2D() and the data assigned to pitch linear memory. This is a code snippet from the device kernel.

int i = threadIdx.x + blockIdx.x * blockDim.x;

int j = threadIdx.y + blockIdx.y * blockDim.y;

data[j*pitch + i] = tex2D( tex_p, i, j );                 // { i, j}

data[j*pitch + i + 1] = tex2D( tex_p, i, j+1 );           // { i, j+1}

data[j*pitch + i + pitch] = tex2D( tex_p, i+1, j );       // {i+1,j}

data[j*pitch + i + pitch + 1] = tex2D( tex_p, i+1, j+1 ); // {i+1,j+1}

How about those middle 2 assignments? It looks like you have i+1 and j+1 reversed.

data[jpitch + i + 1] is data[jpitch + (i + 1)] which is [i+1,j} not [i,j+1]

and

data[j*pitch + i + pitch] is data[(j+1)*pitch + i] which is [i,j+1} not [i+1,j]

Dittoaway, thank you very much for correcting me; I was wondering why I had to switch the indices in my calculations! You are very right about how the memory is accessed. So many thanks for your help! External Image

The updated version is as follows:

int i = threadIdx.x + blockIdx.x * blockDim.x;

int j = threadIdx.y + blockIdx.y * blockDim.y;

data[j*pitch + i] = tex2D( tex_p, i, j );                    // { i, j}

data[j*pitch + i + pitch] = tex2D( tex_p, i, j+1 );          // { i, j+1}

data[j*pitch + i + 1] = tex2D( tex_p, i+1, j );             // {i+1,j}

data[j*pitch + i + pitch + 1] = tex2D( tex_p, i+1, j+1 );   // {i+1,j+1}