1D texture cache

My memory is organized as 2D array. I use read only operations to that memory which implies that using a texture memory is a reasonable choice.

I am accessing memory in a way that each block reads one full column or one full row into shared memory.

2D textures are known to be optimized for spatial locality.
My access pattern is local in the sense that different threads access memory sequentialy.

BUT:
I Assume that the cache is optimized for filtering tasks (e.g. when reading a texel, all the texels surronding it will be broght as well to the cache). This means that the reads from the texture will not be so efficients.
Instead, I want that when I am reading a texel, all the texels that are closed in the x direction will be fetched into the cache.

I thought about using 2 arrays of 1D instead of 1 array of 2D.
The first 1D array will be organized in a columns stack way while the other is row stack.

Does it make sense or am I wrong about the assumption that a 1D texture is more cache optimized for this task?

Another option is to use the shared memory but I need the whole amount of 16k for a different computation so I want to utilze the texture cache too.

Just write a tiny benchmark to test the read performance of a 2D texture going down the column. Then you will know for sure. Then benchmark the same type of read with the 1D texture, though that will make for ugly code with an if statement directing reads from different textures.

I haven’t used 2D textures much, but I would guess you will get the same performance either way. In my experience, getting good performance out of a 1D texture requires that accesses within a warp have good spatial locality. Temporal locality matters little. I would assume the same for the 2D cache. After all, the 2D cache probably just reads tiles.

Code: (note, the funny read into shared memory and then write out with only 1 thread is to force the dead code optimizer to leave the texture read in… this is verified by reading the .ptx. I’m trying to minimize the number of memory writes so that the benchmark will be sampling the texture read alone)

#include <stdio.h>

texture<float4, 2, cudaReadModeElementType> tex;

#define BLOCK_SIZE 256

__global__ void testrow(float *d_out)

	{

	__shared__ float sdata[BLOCK_SIZE*4];

	float4 in = tex2D(tex, blockIdx.x, threadIdx.x);

	sdata[threadIdx.x] = in.x;

	sdata[threadIdx.x+BLOCK_SIZE] = in.y;

	sdata[threadIdx.x+BLOCK_SIZE] = in.z;

	sdata[threadIdx.x+BLOCK_SIZE] = in.w;

	if (threadIdx.x == 0)

  *d_out = sdata[0];

	}

__global__ void testcol(float *d_out)

	{

	__shared__ float sdata[BLOCK_SIZE*4];

	float4 in = tex2D(tex, threadIdx.x, blockIdx.x);

	sdata[threadIdx.x] = in.x;

	sdata[threadIdx.x+BLOCK_SIZE] = in.y;

	sdata[threadIdx.x+BLOCK_SIZE] = in.z;

	sdata[threadIdx.x+BLOCK_SIZE] = in.w;

	if (threadIdx.x == 0)

  *d_out = sdata[0];

	}

texture<float4, 1, cudaReadModeElementType> tex_1d;

__global__ void test1Dtex(float *d_out)

	{

	__shared__ float sdata[BLOCK_SIZE*4];

	float4 in = tex1Dfetch(tex_1d, blockIdx.x*BLOCK_SIZE+ threadIdx.x);

	sdata[threadIdx.x] = in.x;

	sdata[threadIdx.x+BLOCK_SIZE] = in.y;

	sdata[threadIdx.x+BLOCK_SIZE] = in.z;

	sdata[threadIdx.x+BLOCK_SIZE] = in.w;

	if (threadIdx.x == 0)

  *d_out = sdata[0];

	}

int main()

	{

	float *d_out;

	cudaMalloc((void **)&d_out, sizeof(float));

	cudaArray *d_array;

	tex.addressMode[0] = cudaAddressModeClamp;

	tex.addressMode[1] = cudaAddressModeClamp;

	tex.filterMode = cudaFilterModePoint;

	tex.normalized = false;

	cudaMallocArray(&d_array, &tex.channelDesc, BLOCK_SIZE, BLOCK_SIZE);

	cudaBindTextureToArray(tex, d_array);

	for (int i = 0; i < 10; i++)

  testrow<<<BLOCK_SIZE, BLOCK_SIZE>>>(d_out);

	

	for (int i = 0; i < 10; i++)

  testcol<<<BLOCK_SIZE, BLOCK_SIZE>>>(d_out);

	float4 *d_1dtex;

	cudaMalloc((void**)&d_1dtex, sizeof(float4)*BLOCK_SIZE*BLOCK_SIZE);

	cudaBindTexture(0, tex_1d, d_1dtex, sizeof(float4)*BLOCK_SIZE*BLOCK_SIZE);

	

	for (int i = 0; i < 10; i++)

  test1Dtex<<<BLOCK_SIZE, BLOCK_SIZE>>>(d_out);

	

	cudaFreeArray(d_array);

	cudaFree(d_1dtex);

	cudaFree(d_out);

	return 0;

	}

Results:

method=[ testrow ] gputime=[ 17.120 ] cputime=[ 63.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.640 ] cputime=[ 55.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.736 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.384 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.640 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.736 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.576 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.736 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.576 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.672 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.424 ] cputime=[ 59.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.712 ] cputime=[ 57.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.360 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.200 ] cputime=[ 55.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.616 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.168 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.264 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.904 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.168 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.360 ] cputime=[ 55.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 17.568 ] cputime=[ 59.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.864 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.832 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.736 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 17.024 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.800 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.832 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 17.120 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.704 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.896 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

It appears that accessing down a column of the 2d texture incurs a ~15% performance penalty. Of course, you might get the same kind of penalty by branching and reading from 2 different 1D textures.