1D texture cache

weber · October 23, 2007, 10:06pm

My memory is organized as 2D array. I use read only operations to that memory which implies that using a texture memory is a reasonable choice.

I am accessing memory in a way that each block reads one full column or one full row into shared memory.

2D textures are known to be optimized for spatial locality.
My access pattern is local in the sense that different threads access memory sequentialy.

BUT:
I Assume that the cache is optimized for filtering tasks (e.g. when reading a texel, all the texels surronding it will be broght as well to the cache). This means that the reads from the texture will not be so efficients.
Instead, I want that when I am reading a texel, all the texels that are closed in the x direction will be fetched into the cache.

I thought about using 2 arrays of 1D instead of 1 array of 2D.
The first 1D array will be organized in a columns stack way while the other is row stack.

Does it make sense or am I wrong about the assumption that a 1D texture is more cache optimized for this task?

Another option is to use the shared memory but I need the whole amount of 16k for a different computation so I want to utilze the texture cache too.

MisterAnderson42 · October 24, 2007, 2:52am

Just write a tiny benchmark to test the read performance of a 2D texture going down the column. Then you will know for sure. Then benchmark the same type of read with the 1D texture, though that will make for ugly code with an if statement directing reads from different textures.

I haven’t used 2D textures much, but I would guess you will get the same performance either way. In my experience, getting good performance out of a 1D texture requires that accesses within a warp have good spatial locality. Temporal locality matters little. I would assume the same for the 2D cache. After all, the 2D cache probably just reads tiles.

MisterAnderson42 · October 24, 2007, 2:00pm

Code: (note, the funny read into shared memory and then write out with only 1 thread is to force the dead code optimizer to leave the texture read in… this is verified by reading the .ptx. I’m trying to minimize the number of memory writes so that the benchmark will be sampling the texture read alone)

#include <stdio.h>

texture<float4, 2, cudaReadModeElementType> tex;

#define BLOCK_SIZE 256

__global__ void testrow(float *d_out)

	{

	__shared__ float sdata[BLOCK_SIZE*4];

	float4 in = tex2D(tex, blockIdx.x, threadIdx.x);

	sdata[threadIdx.x] = in.x;

	sdata[threadIdx.x+BLOCK_SIZE] = in.y;

	sdata[threadIdx.x+BLOCK_SIZE] = in.z;

	sdata[threadIdx.x+BLOCK_SIZE] = in.w;

	if (threadIdx.x == 0)

  *d_out = sdata[0];

	}

__global__ void testcol(float *d_out)

	{

	__shared__ float sdata[BLOCK_SIZE*4];

	float4 in = tex2D(tex, threadIdx.x, blockIdx.x);

	sdata[threadIdx.x] = in.x;

	sdata[threadIdx.x+BLOCK_SIZE] = in.y;

	sdata[threadIdx.x+BLOCK_SIZE] = in.z;

	sdata[threadIdx.x+BLOCK_SIZE] = in.w;

	if (threadIdx.x == 0)

  *d_out = sdata[0];

	}

texture<float4, 1, cudaReadModeElementType> tex_1d;

__global__ void test1Dtex(float *d_out)

	{

	__shared__ float sdata[BLOCK_SIZE*4];

	float4 in = tex1Dfetch(tex_1d, blockIdx.x*BLOCK_SIZE+ threadIdx.x);

	sdata[threadIdx.x] = in.x;

	sdata[threadIdx.x+BLOCK_SIZE] = in.y;

	sdata[threadIdx.x+BLOCK_SIZE] = in.z;

	sdata[threadIdx.x+BLOCK_SIZE] = in.w;

	if (threadIdx.x == 0)

  *d_out = sdata[0];

	}

int main()

	{

	float *d_out;

	cudaMalloc((void **)&d_out, sizeof(float));

	cudaArray *d_array;

	tex.addressMode[0] = cudaAddressModeClamp;

	tex.addressMode[1] = cudaAddressModeClamp;

	tex.filterMode = cudaFilterModePoint;

	tex.normalized = false;

	cudaMallocArray(&d_array, &tex.channelDesc, BLOCK_SIZE, BLOCK_SIZE);

	cudaBindTextureToArray(tex, d_array);

	for (int i = 0; i < 10; i++)

  testrow<<<BLOCK_SIZE, BLOCK_SIZE>>>(d_out);

	

	for (int i = 0; i < 10; i++)

  testcol<<<BLOCK_SIZE, BLOCK_SIZE>>>(d_out);

	float4 *d_1dtex;

	cudaMalloc((void**)&d_1dtex, sizeof(float4)*BLOCK_SIZE*BLOCK_SIZE);

	cudaBindTexture(0, tex_1d, d_1dtex, sizeof(float4)*BLOCK_SIZE*BLOCK_SIZE);

	

	for (int i = 0; i < 10; i++)

  test1Dtex<<<BLOCK_SIZE, BLOCK_SIZE>>>(d_out);

	

	cudaFreeArray(d_array);

	cudaFree(d_1dtex);

	cudaFree(d_out);

	return 0;

	}

Results:

method=[ testrow ] gputime=[ 17.120 ] cputime=[ 63.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.640 ] cputime=[ 55.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.736 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.384 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.640 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.736 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.576 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.736 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.576 ] cputime=[ 53.000 ] occupancy=[ 1.000 ]

method=[ testrow ] gputime=[ 16.672 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.424 ] cputime=[ 59.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.712 ] cputime=[ 57.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.360 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.200 ] cputime=[ 55.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.616 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.168 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.264 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.904 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.168 ] cputime=[ 56.000 ] occupancy=[ 1.000 ]

method=[ testcol ] gputime=[ 19.360 ] cputime=[ 55.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 17.568 ] cputime=[ 59.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.864 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.832 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.736 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 17.024 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.800 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.832 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 17.120 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.704 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

method=[ test1Dtex ] gputime=[ 16.896 ] cputime=[ 54.000 ] occupancy=[ 1.000 ]

It appears that accessing down a column of the 2d texture incurs a ~15% performance penalty. Of course, you might get the same kind of penalty by branching and reading from 2 different 1D textures.

Topic		Replies	Views
Understanding Performance of 2D Texture memory accesses CUDA Programming and Performance	3	2029	February 23, 2009
Optimal 2D Locality Test for Texture Memory CUDA Programming and Performance	1	1547	September 9, 2009
Using texture cache or L1 and L2 chache CUDA Programming and Performance	7	1298	November 25, 2010
Texture cache characteristics 2D cache size CUDA Programming and Performance	5	6211	May 8, 2007
Any advantance using Texturing from Device Memory tex1Dfetch vs tex2D CUDA Programming and Performance	6	3608	July 31, 2008
Performance Considerations using Texture Access Does the performance depend on the access pattern? CUDA Programming and Performance	1	1423	August 21, 2009
Texture cache architecture Line size of texture cache CUDA Programming and Performance	3	2978	August 27, 2008
Textures: linear memory vs cudaArrays CUDA Programming and Performance	9	7943	October 16, 2007
1D versus 2D textures Spatial Locality CUDA Programming and Performance	2	4203	June 5, 2008
Using 1D Textures CUDA Programming and Performance	1	2412	November 28, 2008

1D texture cache

Related topics