Texture and L1 memory bandwidth

Hi all,

I am a little confused regarding the Fermi Tuning guide. Here is the paragraph that confuses me:

If I understand correctly global memory uses L1 cache and then can be faster than texture memory. Reading some others posts on the subject it is not always the case.

I work with a NVIDIA QUADRO 600 (arch. 2.1) which has a maximum memory bandwidth of 25.6GB/s (cf. http://en.wikipedia…ng_units#Quadro).

Is this bandwidth correspond to the L1 memory bandwidth or is it something different? Also would it be a way to know the theoretical texture memory bandwidth attainable?

Using the Pitch Linear Texture program from the NVIDIA SDK the bandwidth obtained for a pitch linear memory is 7.92 GB/s. It is quite a drop from the maximum possible on my card. I suppose this is the kind of limitation they were talking about in this paragraph.

I have a kernel that is doing a reduction on blocks of 16x16 elements on an array of 1024x1024 int (its equivalent to a sum of 256 elements). It seems to be faster than when I use texture fetches. Here is the kernel code below:

__global__ void BlockReductionKernel(int* pIn, size_t pitch_In_el, int* pOut, size_t pitch_Out_el)

{

 	uint i = blockIdx.x*16 + threadIdx.x; // Row index

	uint j = blockIdx.y*16 + threadIdx.y; // Column index

	uint tid = threadIdx.x*blockDim.y + threadIdx.y; // Thread index

	__shared__ int smem[64]; 

	// Load into shared memory

 	//smem[tid] = pIn[i*pitch_Pyr_el + j] + pIn [(i+4)*pitch_Pyr_el + j] + pIn [(i+8)*pitch_Pyr_el + j] + pIn[(i+12)*pitch_Pyr_el +j];

 	smem[tid] = tex2D(pInTex, j + 0.5f, i + 0.5f) + tex2D(pInTex, j + 0.5f, i + 4 + 0.5f) + tex2D(pInTex, j + 0.5f, i + 8 + 0.5f) + tex2D(pInTex, j + 0.5f, i + 12 + 0.5f);	// Sum of 4 elements within the block of 16x16

	__syncthreads(); 

	if(tid<16) smem[tid] += smem[tid + 16] + smem[tid + 32] + smem[tid + 48];

	if(tid==0)

 	{

 		#pragma unroll

 		for(uint x = 1 ; x<16 ; ++x)

 			smem[0] += smem[x];

		pOut[blockIdx.x*pitch_Out_el + blockIdx.y] = smem[0];

 	}

}

The kernel is launched that way:

BlockReductionKernel<<<dim3(1024/16, 1024/16, 1), dim3(4, 16, 1)>>>(pIn, pitch_In_el, pOut, pitch_In_el);

This is the fastest code I could wrote and I get a bandwidth of 9.321 GB/s (450 us for an array of 1024x1024 pixels). It’s far from the performance obtained with the reduction kernel from the SDK but it’s not doing a reduction on the whole array. I need to keep a size of block of 16x16 elements for my application. I suppose in my case that the performance is limited by the texture bandwidth. Am I right?

So would it be possible to have a better bandwidth in my case using instead global memory ?

Thanks in advance for your help.

Nobody ?

Why are you using a texture at all? This task does not benefit from caching.

Read the data through an int4 pointer, and produce multiple outputs per block. Something like (completely untested!)

#define BLOCKSIZE 64

__global__ void BlockReductionKernel(int4* pIn, int pitch_In_el, int* pOut, int pitch_Out_el)

{

        int tid = threadIdx.x;

        pIn += blockIdx.x*BLOCKSIZE + tid; // select column

        pIn += blockIdx.y*16*pitch_In_el;  // select row

__shared__ int smem[BLOCKSIZE];

        int sum = 0;

#pragma unroll

        for (int k=0; k<16; k++) {

                sum += pIn->x + pIn->y + pIn->z + pIn->w;

                pIn += pitch_In_el;

        }

smem[threadIdx.x] = sum;

        __syncthreads(); 

if (threadIdx.x < BLOCKSIZE/4)

                pOut[blockIdx.y * pitch_Out_el + (BLOCKSIZE/4)*blockIdx.x + tid] = smem[4*tid] + smem[4*tid+1] + smem[4*tid+2] + smem[4*tid+3];

}

and launch it like

BlockReductionKernel<<<dim3(1024/(4*BLOCKSIZE), 1024/16, 1), dim3(BLOCKSIZE, 1, 1)>>>((int4 *)pIn, pitch_In_el/4, pOut, pitch_In_el);

EDIT: Coalesce all writes on CC 1.0 and 1.1

Thanks tera.

The code you have provided is more than twice as fast as mine. I have an execution time of 210 us (which correspond to a bandwidth of 19.979 GB/s) (I updated my previous timing, I gave the wrong one initially). I really learned something today. Many thanks.

Also to answer to your question I used the texture memory because it was faster in my case.

By the way there is a small mistake in your code. You have to replace the last line :

pOut[blockIdx.y * pitch_Out_el + 8*blockIdx.x + tid] = smem[4*tid] + smem[4*tid+1] + smem[4*tid+2] + smem[4*tid+3];

by this one

pOut[blockIdx.y * pitch_Out_el + 16*blockIdx.x + tid] = smem[4*tid] + smem[4*tid+1] + smem[4*tid+2] + smem[4*tid+3];

Have a great day.

Oh yes, I missed that line when I edited the post. Now corrected.

Hi tera,

I would like some clarifications about the code you posted. I was thinking about it this morning and everything went blurry :) In the Fermi architecture, the L1 cache line is 128 bytes and each wrap contains 32 threads. That means that for a coalesced read each thread has to read 4 bytes (or one int in our case). But here each thread reads 16 bytes (uint4). Does it means that your code is not coalesced ?

Your algorithm can processed 16 blocks (16x16 int) with 64 threads. So each wrap need a memory segment of 512 bytes (along the horizontal dimension). That means that there is 4 memory transactions of 128 bytes each for each wrap. Am I right ?

Thanks.

No, the accesses are still coalesced. Each warp just loads 4 cache lines.

You are right that each read will result in four 128-byte transactions per warp. However, because these transactions are generated in direct succession, the transactions arriving at each memory controller will more often go to an already open row (sometimes also called page) . As opening a new row requires an extra cycle on the memory bus, these (partially) ordered read requests complete in fewer cycles overall.

Ok so can we still call that a coalesced memory access as DrAnderson42 said ?

Of course. Dr. Anderson is a highly regarded member of this forum!

I didn’t said he was wrong :smile: In my point of view I was wondering if the speed of the algorithm (knowing that the access was not coalesced) was only because the L1 cache was used with the correct load granularity (128 bytes segments).

So the access is coalesced. Is it because each thread accesses four consecutive integers and the memory location stays within the same cache line ? For instance if each thread reads 8 consecutive integers would the memory access still be coalesced (8 caches lines for 32 threads) ?

There is no machine instruction to read 8 integers at once, so such a read would not be coalesced.

On Fermi, coalescing however is of a lot less importance than on previous architectures (which probably is why the Programming Guide does not use this term with respect to the Fermi architecture). In the case of an 8-integer read, the compiler would probably generate two 4-integer reads. The first one would have to read the whole 256 bytes (including the gaps). The second read would likely be served from the cache. Overall, this might come close to the performance of two coalesced 4-integer reads. Only experiments will tell.

It explains a lot of things. Thank you tera. I’ve to stop worrying about coalescing for Fermi architecture then. So If I understand correctly, if I want the best performance on Fermi, my data must be read consecutively using either 32, 64 or 128 bits words and my data should fit in a multiple of 128 bytes cache line. It explains also why the kernel I posted takes twice as long as yours, only half of the memory on a cache line was used: 16 ints * 4 bytes per int = 64 bytes.

You may also find this webinar interesting:
http://developer.download.nvidia.com/CUDA/training/bandwidthlimitedkernels_webinar.pdf

Particularly, the first slide with a graph which shows performance differences between 32, 64, and 128 bit loads. Note that it assumes 2 accesses (load and store) per thread - so as the slide says: “Several independent smaller accesses have the same effect as one larger one.” The take-away message is that unlike on earlier generations, the size of the memory transaction doesn’t matter so much on Fermi. But you do need to make sure that you are using all 128 bytes loaded from each cache line, or you lose potential performance.

If you cannot arrange your computation to effectively use the entire cache line, try tex1Dfetch - you may get better performance out of texture cache.

Thanks. This is very informative. :smile: