Using texture cache or L1 and L2 chache

Hi

I have a decision to make and need some help to make it. The problem I have is that I can choose to use either the texture cache or L1 and L2 cache of my fermi card to read data. I wonder which is the preferred method when it comes to speed?

Which cache is faster?

In my code I will perform many gathers in spread memory locations so a cache is good.

Hi

I have a decision to make and need some help to make it. The problem I have is that I can choose to use either the texture cache or L1 and L2 cache of my fermi card to read data. I wonder which is the preferred method when it comes to speed?

Which cache is faster?

In my code I will perform many gathers in spread memory locations so a cache is good.

Hi,
You’ll just need to try both (which is not so hard to do) and see which one works best.
Each application might behave differently. In my code the textures code ran much faster than the cache
in addition to the fact that the shared memory was 48K instead of 16K.

eyal

Hi,
You’ll just need to try both (which is not so hard to do) and see which one works best.
Each application might behave differently. In my code the textures code ran much faster than the cache
in addition to the fact that the shared memory was 48K instead of 16K.

eyal

For read only access use 1d textures… they are cached for locality and will help if your access patterns are not completely scrambled…

For read only access use 1d textures… they are cached for locality and will help if your access patterns are not completely scrambled…

In my case I will perform reads from different parts of the memory but with some locality. I will do some gathers with some locality in my algorithm so I think texture cache is good.

Slightly off topic but I need some direction on how to create a 2D texture with float4s

I have this code: Will my cudaArray’s elements contain float4s?

#define TEX_WIDTH 256

	#define TEX_HEIGHT 256

	float h_bottom[TEX_HEIGHT][TEX_WIDTH*4];

	float h_top[TEX_HEIGHT][TEX_WIDTH*4];

	float h_front[TEX_HEIGHT][TEX_WIDTH*4];

	float h_back[TEX_HEIGHT][TEX_WIDTH*4];

	float h_left[TEX_HEIGHT][TEX_WIDTH*4];

	float h_right[TEX_HEIGHT][TEX_WIDTH*4];

//fill arrays with data

cudaStream stream[7];

	for(int i = 0; i < 7; ++i) 

		cudaStreamCreate(&stream[i]);

	//perform a async memcpy from host data to cudaArray

	cudaArray *d_bo, *d_t, *d_f, *d_ba, *d_l, *d_r;

	cudaMalloc3DArray(&d_bo, channelDesc, extent);

	cudaMalloc3DArray(&d_t, channelDesc, extent);

	cudaMalloc3DArray(&d_f, channelDesc, extent);

	cudaMalloc3DArray(&d_ba, channelDesc, extent);

	cudaMalloc3DArray(&d_l, channelDesc, extent);

	cudaMalloc3DArray(&d_r, channelDesc, extent);

//channelDesc is {32, 32, 32, 32, cudaChannelFormatKindFloat} //is it ok to

	cudaMemcpy2DToArrayAsync(d_bo, 0, 0, h_bottom, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[1]);

	cudaMemcpy2DToArrayAsync(d_t, 0, 0, h_top, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[2]);

	cudaMemcpy2DToArrayAsync(d_f, 0, 0, h_front, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[3]);

	cudaMemcpy2DToArrayAsync(d_ba, 0, 0, h_back, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[4]);

	cudaMemcpy2DToArrayAsync(d_l, 0, 0, h_left, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[5]);

	cudaMemcpy2DToArrayAsync(d_r, 0, 0, h_right, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[6]);

Does channelDesc elements have to match the type ‘type’ belong when texture ref is declared:

texture<Type, Dim, ReadMode> texRef;

my texture ref is has Type as float4. Does my channelDesc structure have to have the w component filled in as 32?

Also my declaration of host array is ok when I do memcpy2DToArrayAsync with the given pitch? The spitch parameter should be the same as the width, am I correct+

In my case I will perform reads from different parts of the memory but with some locality. I will do some gathers with some locality in my algorithm so I think texture cache is good.

Slightly off topic but I need some direction on how to create a 2D texture with float4s

I have this code: Will my cudaArray’s elements contain float4s?

#define TEX_WIDTH 256

	#define TEX_HEIGHT 256

	float h_bottom[TEX_HEIGHT][TEX_WIDTH*4];

	float h_top[TEX_HEIGHT][TEX_WIDTH*4];

	float h_front[TEX_HEIGHT][TEX_WIDTH*4];

	float h_back[TEX_HEIGHT][TEX_WIDTH*4];

	float h_left[TEX_HEIGHT][TEX_WIDTH*4];

	float h_right[TEX_HEIGHT][TEX_WIDTH*4];

//fill arrays with data

cudaStream stream[7];

	for(int i = 0; i < 7; ++i) 

		cudaStreamCreate(&stream[i]);

	//perform a async memcpy from host data to cudaArray

	cudaArray *d_bo, *d_t, *d_f, *d_ba, *d_l, *d_r;

	cudaMalloc3DArray(&d_bo, channelDesc, extent);

	cudaMalloc3DArray(&d_t, channelDesc, extent);

	cudaMalloc3DArray(&d_f, channelDesc, extent);

	cudaMalloc3DArray(&d_ba, channelDesc, extent);

	cudaMalloc3DArray(&d_l, channelDesc, extent);

	cudaMalloc3DArray(&d_r, channelDesc, extent);

//channelDesc is {32, 32, 32, 32, cudaChannelFormatKindFloat} //is it ok to

	cudaMemcpy2DToArrayAsync(d_bo, 0, 0, h_bottom, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[1]);

	cudaMemcpy2DToArrayAsync(d_t, 0, 0, h_top, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[2]);

	cudaMemcpy2DToArrayAsync(d_f, 0, 0, h_front, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[3]);

	cudaMemcpy2DToArrayAsync(d_ba, 0, 0, h_back, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[4]);

	cudaMemcpy2DToArrayAsync(d_l, 0, 0, h_left, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[5]);

	cudaMemcpy2DToArrayAsync(d_r, 0, 0, h_right, sizeof(float)*4*extent.width, sizeof(float)*4*extent.width,

		extent.height,cudaMemcpyHostToDevice, stream[6]);

Does channelDesc elements have to match the type ‘type’ belong when texture ref is declared:

texture<Type, Dim, ReadMode> texRef;

my texture ref is has Type as float4. Does my channelDesc structure have to have the w component filled in as 32?

Also my declaration of host array is ok when I do memcpy2DToArrayAsync with the given pitch? The spitch parameter should be the same as the width, am I correct+