Using texture cache or L1 and L2 chache

tilo · November 24, 2010, 8:39pm

Hi

I have a decision to make and need some help to make it. The problem I have is that I can choose to use either the texture cache or L1 and L2 cache of my fermi card to read data. I wonder which is the preferred method when it comes to speed?

Which cache is faster?

In my code I will perform many gathers in spread memory locations so a cache is good.

tilo · November 24, 2010, 8:39pm

Hi

I have a decision to make and need some help to make it. The problem I have is that I can choose to use either the texture cache or L1 and L2 cache of my fermi card to read data. I wonder which is the preferred method when it comes to speed?

Which cache is faster?

In my code I will perform many gathers in spread memory locations so a cache is good.

eyalhir74 · November 25, 2010, 8:01am

Hi,
You’ll just need to try both (which is not so hard to do) and see which one works best.
Each application might behave differently. In my code the textures code ran much faster than the cache
in addition to the fact that the shared memory was 48K instead of 16K.

eyal

eyalhir74 · November 25, 2010, 8:01am

Hi,
You’ll just need to try both (which is not so hard to do) and see which one works best.
Each application might behave differently. In my code the textures code ran much faster than the cache
in addition to the fact that the shared memory was 48K instead of 16K.

eyal

Debdatta · November 25, 2010, 12:21pm

For read only access use 1d textures… they are cached for locality and will help if your access patterns are not completely scrambled…

Debdatta · November 25, 2010, 12:21pm

For read only access use 1d textures… they are cached for locality and will help if your access patterns are not completely scrambled…

tilo · November 25, 2010, 6:53pm

In my case I will perform reads from different parts of the memory but with some locality. I will do some gathers with some locality in my algorithm so I think texture cache is good.

Slightly off topic but I need some direction on how to create a 2D texture with float4s

I have this code: Will my cudaArray’s elements contain float4s?

#define TEX_WIDTH 256

	#define TEX_HEIGHT 256

	float h_bottom[TEX_HEIGHT][TEX_WIDTH*4];

	float h_top[TEX_HEIGHT][TEX_WIDTH*4];

	float h_front[TEX_HEIGHT][TEX_WIDTH*4];

	float h_back[TEX_HEIGHT][TEX_WIDTH*4];

	float h_left[TEX_HEIGHT][TEX_WIDTH*4];

	float h_right[TEX_HEIGHT][TEX_WIDTH*4];

//fill arrays with data

cudaStream stream[7];

	for(int i = 0; i < 7; ++i) 

		cudaStreamCreate(&stream[i]);

	//perform a async memcpy from host data to cudaArray

	cudaArray *d_bo, *d_t, *d_f, *d_ba, *d_l, *d_r;

	cudaMalloc3DArray(&d_bo, channelDesc, extent);

	cudaMalloc3DArray(&d_t, channelDesc, extent);

	cudaMalloc3DArray(&d_f, channelDesc, extent);

	cudaMalloc3DArray(&d_ba, channelDesc, extent);

	cudaMalloc3DArray(&d_l, channelDesc, extent);

	cudaMalloc3DArray(&d_r, channelDesc, extent);

//channelDesc is {32, 32, 32, 32, cudaChannelFormatKindFloat} //is it ok to

	cudaMemcpy2DToArrayAsync(d_bo, 0, 0, h_bottom, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[1]);

	cudaMemcpy2DToArrayAsync(d_t, 0, 0, h_top, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[2]);

	cudaMemcpy2DToArrayAsync(d_f, 0, 0, h_front, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[3]);

	cudaMemcpy2DToArrayAsync(d_ba, 0, 0, h_back, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[4]);

	cudaMemcpy2DToArrayAsync(d_l, 0, 0, h_left, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[5]);

	cudaMemcpy2DToArrayAsync(d_r, 0, 0, h_right, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[6]);

Does channelDesc elements have to match the type ‘type’ belong when texture ref is declared:

texture<Type, Dim, ReadMode> texRef;

my texture ref is has Type as float4. Does my channelDesc structure have to have the w component filled in as 32?

Also my declaration of host array is ok when I do memcpy2DToArrayAsync with the given pitch? The spitch parameter should be the same as the width, am I correct+

tilo · November 25, 2010, 6:53pm

In my case I will perform reads from different parts of the memory but with some locality. I will do some gathers with some locality in my algorithm so I think texture cache is good.

Slightly off topic but I need some direction on how to create a 2D texture with float4s

I have this code: Will my cudaArray’s elements contain float4s?

#define TEX_WIDTH 256

	#define TEX_HEIGHT 256

	float h_bottom[TEX_HEIGHT][TEX_WIDTH*4];

	float h_top[TEX_HEIGHT][TEX_WIDTH*4];

	float h_front[TEX_HEIGHT][TEX_WIDTH*4];

	float h_back[TEX_HEIGHT][TEX_WIDTH*4];

	float h_left[TEX_HEIGHT][TEX_WIDTH*4];

	float h_right[TEX_HEIGHT][TEX_WIDTH*4];

//fill arrays with data

cudaStream stream[7];

	for(int i = 0; i < 7; ++i) 

		cudaStreamCreate(&stream[i]);

	//perform a async memcpy from host data to cudaArray

	cudaArray *d_bo, *d_t, *d_f, *d_ba, *d_l, *d_r;

	cudaMalloc3DArray(&d_bo, channelDesc, extent);

	cudaMalloc3DArray(&d_t, channelDesc, extent);

	cudaMalloc3DArray(&d_f, channelDesc, extent);

	cudaMalloc3DArray(&d_ba, channelDesc, extent);

	cudaMalloc3DArray(&d_l, channelDesc, extent);

	cudaMalloc3DArray(&d_r, channelDesc, extent);

//channelDesc is {32, 32, 32, 32, cudaChannelFormatKindFloat} //is it ok to

	cudaMemcpy2DToArrayAsync(d_bo, 0, 0, h_bottom, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[1]);

	cudaMemcpy2DToArrayAsync(d_t, 0, 0, h_top, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[2]);

	cudaMemcpy2DToArrayAsync(d_f, 0, 0, h_front, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[3]);

	cudaMemcpy2DToArrayAsync(d_ba, 0, 0, h_back, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[4]);

	cudaMemcpy2DToArrayAsync(d_l, 0, 0, h_left, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[5]);

	cudaMemcpy2DToArrayAsync(d_r, 0, 0, h_right, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[6]);

Does channelDesc elements have to match the type ‘type’ belong when texture ref is declared:

texture<Type, Dim, ReadMode> texRef;

my texture ref is has Type as float4. Does my channelDesc structure have to have the w component filled in as 32?

Also my declaration of host array is ok when I do memcpy2DToArrayAsync with the given pitch? The spitch parameter should be the same as the width, am I correct+

Topic		Replies	Views
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2243	July 19, 2010
Multiple textures vs Single Multichannel texures Which is faster? CUDA Programming and Performance	8	9372	May 27, 2008
Memory performance in image processing example CUDA Programming and Performance	9	1600	March 24, 2011
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9794	December 14, 2011
Texture / Array Access CUDA Programming and Performance	20	21445	April 19, 2008
Textures CUDA Programming and Performance	2	1620	July 22, 2008
Understanding GPU caches can't get my head around it CUDA Programming and Performance	12	4744	March 14, 2009
2D spatial locality for L2 cache on Fermi CUDA Programming and Performance	8	2410	January 19, 2011
Textures: linear memory vs cudaArrays CUDA Programming and Performance	9	7773	October 16, 2007
Shared Mem caching strategy Comparison of benchmark results CUDA Programming and Performance	9	4187	May 11, 2008

Using texture cache or L1 and L2 chache

Related topics