Texture and L1 memory bandwidth

Twz · December 9, 2011, 12:41am

Hi all,

I am a little confused regarding the Fermi Tuning guide. Here is the paragraph that confuses me:

If I understand correctly global memory uses L1 cache and then can be faster than texture memory. Reading some others posts on the subject it is not always the case.

I work with a NVIDIA QUADRO 600 (arch. 2.1) which has a maximum memory bandwidth of 25.6GB/s (cf. http://en.wikipedia…ng_units#Quadro).

Is this bandwidth correspond to the L1 memory bandwidth or is it something different? Also would it be a way to know the theoretical texture memory bandwidth attainable?

Using the Pitch Linear Texture program from the NVIDIA SDK the bandwidth obtained for a pitch linear memory is 7.92 GB/s. It is quite a drop from the maximum possible on my card. I suppose this is the kind of limitation they were talking about in this paragraph.

I have a kernel that is doing a reduction on blocks of 16x16 elements on an array of 1024x1024 int (its equivalent to a sum of 256 elements). It seems to be faster than when I use texture fetches. Here is the kernel code below:

__global__ void BlockReductionKernel(int* pIn, size_t pitch_In_el, int* pOut, size_t pitch_Out_el)

{

 	uint i = blockIdx.x*16 + threadIdx.x; // Row index

	uint j = blockIdx.y*16 + threadIdx.y; // Column index

	uint tid = threadIdx.x*blockDim.y + threadIdx.y; // Thread index

	__shared__ int smem[64]; 

	// Load into shared memory

 	//smem[tid] = pIn[i*pitch_Pyr_el + j] + pIn [(i+4)*pitch_Pyr_el + j] + pIn [(i+8)*pitch_Pyr_el + j] + pIn[(i+12)*pitch_Pyr_el +j];

 	smem[tid] = tex2D(pInTex, j + 0.5f, i + 0.5f) + tex2D(pInTex, j + 0.5f, i + 4 + 0.5f) + tex2D(pInTex, j + 0.5f, i + 8 + 0.5f) + tex2D(pInTex, j + 0.5f, i + 12 + 0.5f);	// Sum of 4 elements within the block of 16x16

	__syncthreads(); 

	if(tid<16) smem[tid] += smem[tid + 16] + smem[tid + 32] + smem[tid + 48];

	if(tid==0)

 	{

 		#pragma unroll

 		for(uint x = 1 ; x<16 ; ++x)

 			smem[0] += smem[x];

		pOut[blockIdx.x*pitch_Out_el + blockIdx.y] = smem[0];

 	}

}

The kernel is launched that way:

BlockReductionKernel<<<dim3(1024/16, 1024/16, 1), dim3(4, 16, 1)>>>(pIn, pitch_In_el, pOut, pitch_In_el);

This is the fastest code I could wrote and I get a bandwidth of 9.321 GB/s (450 us for an array of 1024x1024 pixels). Itâ€™s far from the performance obtained with the reduction kernel from the SDK but itâ€™s not doing a reduction on the whole array. I need to keep a size of block of 16x16 elements for my application. I suppose in my case that the performance is limited by the texture bandwidth. Am I right?

So would it be possible to have a better bandwidth in my case using instead global memory ?

Thanks in advance for your help.

Twz · December 12, 2011, 6:31pm

Nobody ?

tera · December 12, 2011, 8:02pm

Why are you using a texture at all? This task does not benefit from caching.

Read the data through an int4 pointer, and produce multiple outputs per block. Something like (completely untested!)

#define BLOCKSIZE 64

__global__ void BlockReductionKernel(int4* pIn, int pitch_In_el, int* pOut, int pitch_Out_el)

{

        int tid = threadIdx.x;

        pIn += blockIdx.x*BLOCKSIZE + tid; // select column

        pIn += blockIdx.y*16*pitch_In_el;  // select row

__shared__ int smem[BLOCKSIZE];

        int sum = 0;

#pragma unroll

        for (int k=0; k<16; k++) {

                sum += pIn->x + pIn->y + pIn->z + pIn->w;

                pIn += pitch_In_el;

        }

smem[threadIdx.x] = sum;

        __syncthreads(); 

if (threadIdx.x < BLOCKSIZE/4)

                pOut[blockIdx.y * pitch_Out_el + (BLOCKSIZE/4)*blockIdx.x + tid] = smem[4*tid] + smem[4*tid+1] + smem[4*tid+2] + smem[4*tid+3];

}

and launch it like

BlockReductionKernel<<<dim3(1024/(4*BLOCKSIZE), 1024/16, 1), dim3(BLOCKSIZE, 1, 1)>>>((int4 *)pIn, pitch_In_el/4, pOut, pitch_In_el);

EDIT: Coalesce all writes on CC 1.0 and 1.1

Twz · December 13, 2011, 12:06am

Thanks tera.

The code you have provided is more than twice as fast as mine. I have an execution time of 210 us (which correspond to a bandwidth of 19.979 GB/s) (I updated my previous timing, I gave the wrong one initially). I really learned something today. Many thanks.

Also to answer to your question I used the texture memory because it was faster in my case.

By the way there is a small mistake in your code. You have to replace the last line :

pOut[blockIdx.y * pitch_Out_el + 8*blockIdx.x + tid] = smem[4*tid] + smem[4*tid+1] + smem[4*tid+2] + smem[4*tid+3];

by this one

pOut[blockIdx.y * pitch_Out_el + 16*blockIdx.x + tid] = smem[4*tid] + smem[4*tid+1] + smem[4*tid+2] + smem[4*tid+3];

Have a great day.

tera · December 13, 2011, 12:23am

Oh yes, I missed that line when I edited the post. Now corrected.

Twz · December 13, 2011, 7:17pm

Hi tera,

I would like some clarifications about the code you posted. I was thinking about it this morning and everything went blurry :) In the Fermi architecture, the L1 cache line is 128 bytes and each wrap contains 32 threads. That means that for a coalesced read each thread has to read 4 bytes (or one int in our case). But here each thread reads 16 bytes (uint4). Does it means that your code is not coalesced ?

Your algorithm can processed 16 blocks (16x16 int) with 64 threads. So each wrap need a memory segment of 512 bytes (along the horizontal dimension). That means that there is 4 memory transactions of 128 bytes each for each wrap. Am I right ?

Thanks.

DrAnderson42 · December 13, 2011, 7:50pm

No, the accesses are still coalesced. Each warp just loads 4 cache lines.

tera · December 13, 2011, 8:03pm

You are right that each read will result in four 128-byte transactions per warp. However, because these transactions are generated in direct succession, the transactions arriving at each memory controller will more often go to an already open row (sometimes also called page) . As opening a new row requires an extra cycle on the memory bus, these (partially) ordered read requests complete in fewer cycles overall.

Twz · December 13, 2011, 9:11pm

Ok so can we still call that a coalesced memory access as DrAnderson42 said ?

tera · December 13, 2011, 9:27pm

Of course. Dr. Anderson is a highly regarded member of this forum!

Twz · December 13, 2011, 10:03pm

I didn’t said he was wrong External Image In my point of view I was wondering if the speed of the algorithm (knowing that the access was not coalesced) was only because the L1 cache was used with the correct load granularity (128 bytes segments).

So the access is coalesced. Is it because each thread accesses four consecutive integers and the memory location stays within the same cache line ? For instance if each thread reads 8 consecutive integers would the memory access still be coalesced (8 caches lines for 32 threads) ?

tera · December 13, 2011, 10:23pm

There is no machine instruction to read 8 integers at once, so such a read would not be coalesced.

On Fermi, coalescing however is of a lot less importance than on previous architectures (which probably is why the Programming Guide does not use this term with respect to the Fermi architecture). In the case of an 8-integer read, the compiler would probably generate two 4-integer reads. The first one would have to read the whole 256 bytes (including the gaps). The second read would likely be served from the cache. Overall, this might come close to the performance of two coalesced 4-integer reads. Only experiments will tell.

Twz · December 13, 2011, 10:57pm

It explains a lot of things. Thank you tera. I’ve to stop worrying about coalescing for Fermi architecture then. So If I understand correctly, if I want the best performance on Fermi, my data must be read consecutively using either 32, 64 or 128 bits words and my data should fit in a multiple of 128 bytes cache line. It explains also why the kernel I posted takes twice as long as yours, only half of the memory on a cache line was used: 16 ints * 4 bytes per int = 64 bytes.

DrAnderson42 · December 14, 2011, 1:22am

You may also find this webinar interesting:

Particularly, the first slide with a graph which shows performance differences between 32, 64, and 128 bit loads. Note that it assumes 2 accesses (load and store) per thread - so as the slide says: “Several independent smaller accesses have the same effect as one larger one.” The take-away message is that unlike on earlier generations, the size of the memory transaction doesn’t matter so much on Fermi. But you do need to make sure that you are using all 128 bytes loaded from each cache line, or you lose potential performance.

If you cannot arrange your computation to effectively use the entire cache line, try tex1Dfetch - you may get better performance out of texture cache.

Twz · December 14, 2011, 4:16pm

Thanks. This is very informative. External Image

Topic		Replies	Views
Why texture/constant memory under FERMI architecture CUDA Programming and Performance	23	4021	November 3, 2010
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1837	November 1, 2010
Why texture memory is better on Fermi? CUDA Programming and Performance	62	20829	January 28, 2011
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2255	July 19, 2010
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7143	April 17, 2013
Memory coalescing in one thread CUDA Programming and Performance	17	16591	March 31, 2011
Texture vs Global Memory Bandwidth CUDA Programming and Performance	5	6559	March 25, 2010
Suggestion for fermi: kill shared memory bank conflicts + coalescing CUDA Programming and Performance	8	10484	January 24, 2010
GTX 470 performance gains too low ? (texture operations) CUDA Programming and Performance	16	10941	April 22, 2010
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2788	August 9, 2010

Texture and L1 memory bandwidth

Related topics