No performance inprovement shared mem x global mem

Dalibor_CZ · April 25, 2013, 9:20am

Hallo,
I am writing matrix-matrix multiplication program for GPU (for square matrices). I have two kernels - one reading elements of matrices directly from global memory and the other utilizes shared memory to coalesce reading of both input matrices.

__global__ void Kernel_global(float* d_C, float* d_A, float* d_B, int n){
  //blockDim.x is equal to n

  int i;
  int j;
  
  int indexA;
  int indexB;
  int indexC;
  
  int k;
  
  float temp = 0.f;
  
  j=threadIdx.x;
  i=threadIdx.y;
  
  indexC=i*n+j;
  
  #pragma unroll 2
  for(k=0;k<n;++k){
    indexA=i*n+k;
    indexB=k*n+j;
    
    temp+=d_A[indexA]*d_B[indexB];
  }
  
  d_C[indexC]=temp;
}

__global__ void Kernel_shared(float* d_C, float* d_A, float* d_B, int n){
 //blockDim.x is equal to n

  extern __shared__ float in_mats [];
  
  float* s_A;
  float* s_B;
    
  int i;
  int j;
  
  int indexA;
  int indexB;
  int indexC;
  
  int k;
  
  float temp = 0.f;
  
  j=threadIdx.x;
  i=threadIdx.y;
  
  s_A = in_mats;
  s_B = in_mats+n*n;
    
  s_A[i*blockDim.x+j]=d_A[i*blockDim.x+j];
  s_B[i*blockDim.x+j]=d_B[i*blockDim.x+j];
  
  indexC=i*n+j;
  
  __syncthreads();
  #pragma unroll 2
  for(k=0;k<n;++k){
    indexA=i*n+k;
    indexB=k*n+j;
    
    temp+=s_A[indexA]*s_B[indexB];
  }
  
  d_C[indexC]=temp;
}

The Kernel is run only on one SM, with all threads in one block thus maximal dimension of matrices is limited to 32x32=1024.

I am really surprised that both kernels have equal execution time on GTX 460. I expected speed-up due to coalesced reading of matrix d_A. Can anyone clarify it to me please?

SuMSeQ · April 25, 2013, 11:35pm

I’ll add to your question: I have a finite difference kernel that on 3.0 runs two times FASTER with NO shared memory than when using shared memory.

My guess is that perhaps the GPU’s L1 and L2 caching is better than our own shared memory management?

Gregory_Diamos · April 26, 2013, 12:53am

This isn’t completely unheard of. There are many possible reasons why this could happen, here’s an example:

Shared memory use limits SM occupancy. This means that if a kernel doesn’t use shared memory, it may be possible to run more warps concurrently on the SM. Sometimes this results in a speedup if the kernel is limited by occupancy, and global memory accesses don’t add too much overhead.

In Dalibor_CZ’s case, the input matrices should both fit in the L1, and the access pattern should make good use out of the cache (A is broadcast to all threads in the warp, B should align to a cache-line).

SPWorley · April 26, 2013, 3:23am

For Fermi SM 2.1 (which includes the GF104 GTX460), the L1 and shared memory performance are identical. As Gregory mentioned, since the data fits in L1, you basically get the same speed of shared, with less programming hassle.

For older Fermi SM2.0, L1 is empirically about 40% slower than shared accesses, so it’s worth manually copying reused data to shared if the memory bandwidth is a bottleneck.

But also as Gregory pointed out, using shared will change your launch configuration and may prevent you from running as many blocks concurrently.

We swim in an ocean of performance tradeoff options!

Dalibor_CZ · April 26, 2013, 8:04am

Thanks for all replies.

To Gregory:
Just few simple newbie questions. Data in L1 cache are stored in lines, right? Is it an equivalent to banks in shared memory? Because in fact it is the same type of memory. As far as I know, there are 32 banks (4 byte) in SM 2.1. Therefore, as you pointed out, B should be aligned differently to cause no ,bank conflicts, in cache (or shared memory, in case of storage in shared memory).

To SPWorley:
I did not know that cache performance in SM 2.0 differs so much form 2.1, thanks.

Gregory_Diamos · April 26, 2013, 6:31pm

Many of the details of the Fermi L1 are described in the CUDA Programming Guide:

“A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory.
Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory
transactions whereas memory accesses that are cached in L2 only are serviced with
32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for
example, in the case of scattered memory accesses.
If the size of the words accessed by each thread is more than 4 bytes, a memory
request by a warp is first split into separate 128-byte memory requests that are issued
independently.”

In your example, accesses to B are aligned to a 128 byte cache line.

Topic		Replies	Views
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	0	625	July 12, 2011
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3292	October 14, 2009
about shared memory's contribution to performance when global memory access is coalesced CUDA Programming and Performance	3	3571	July 12, 2011
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3728	June 27, 2011
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10734	February 8, 2008
shared memory problem CUDA Programming and Performance	2	1223	April 21, 2010
Disappointing shared memory performance CUDA Programming and Performance	3	807	September 8, 2011
access speed of shared memory and global memory CUDA Programming and Performance	1	1113	August 6, 2009
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	4022	March 2, 2009
CUDA gemm with shared memory is slower than with global memory CUDA Programming and Performance cuda	3	138	November 4, 2024

No performance inprovement shared mem x global mem

Related topics