No performance inprovement shared mem x global mem

Hallo,
I am writing matrix-matrix multiplication program for GPU (for square matrices). I have two kernels - one reading elements of matrices directly from global memory and the other utilizes shared memory to coalesce reading of both input matrices.

__global__ void Kernel_global(float* d_C, float* d_A, float* d_B, int n){
  //blockDim.x is equal to n

  int i;
  int j;
  
  int indexA;
  int indexB;
  int indexC;
  
  int k;
  
  float temp = 0.f;
  
  j=threadIdx.x;
  i=threadIdx.y;
  
  indexC=i*n+j;
  
  #pragma unroll 2
  for(k=0;k<n;++k){
    indexA=i*n+k;
    indexB=k*n+j;
    
    temp+=d_A[indexA]*d_B[indexB];
  }
  
  d_C[indexC]=temp;
}

__global__ void Kernel_shared(float* d_C, float* d_A, float* d_B, int n){
 //blockDim.x is equal to n

  extern __shared__ float in_mats [];
  
  float* s_A;
  float* s_B;
    
  int i;
  int j;
  
  int indexA;
  int indexB;
  int indexC;
  
  int k;
  
  float temp = 0.f;
  
  j=threadIdx.x;
  i=threadIdx.y;
  
  s_A = in_mats;
  s_B = in_mats+n*n;
    
  s_A[i*blockDim.x+j]=d_A[i*blockDim.x+j];
  s_B[i*blockDim.x+j]=d_B[i*blockDim.x+j];
  
  indexC=i*n+j;
  
  __syncthreads();
  #pragma unroll 2
  for(k=0;k<n;++k){
    indexA=i*n+k;
    indexB=k*n+j;
    
    temp+=s_A[indexA]*s_B[indexB];
  }
  
  d_C[indexC]=temp;
}

The Kernel is run only on one SM, with all threads in one block thus maximal dimension of matrices is limited to 32x32=1024.

I am really surprised that both kernels have equal execution time on GTX 460. I expected speed-up due to coalesced reading of matrix d_A. Can anyone clarify it to me please?

I’ll add to your question: I have a finite difference kernel that on 3.0 runs two times FASTER with NO shared memory than when using shared memory.

My guess is that perhaps the GPU’s L1 and L2 caching is better than our own shared memory management?

This isn’t completely unheard of. There are many possible reasons why this could happen, here’s an example:

Shared memory use limits SM occupancy. This means that if a kernel doesn’t use shared memory, it may be possible to run more warps concurrently on the SM. Sometimes this results in a speedup if the kernel is limited by occupancy, and global memory accesses don’t add too much overhead.

In Dalibor_CZ’s case, the input matrices should both fit in the L1, and the access pattern should make good use out of the cache (A is broadcast to all threads in the warp, B should align to a cache-line).

For Fermi SM 2.1 (which includes the GF104 GTX460), the L1 and shared memory performance are identical. As Gregory mentioned, since the data fits in L1, you basically get the same speed of shared, with less programming hassle.

For older Fermi SM2.0, L1 is empirically about 40% slower than shared accesses, so it’s worth manually copying reused data to shared if the memory bandwidth is a bottleneck.

But also as Gregory pointed out, using shared will change your launch configuration and may prevent you from running as many blocks concurrently.

We swim in an ocean of performance tradeoff options!

Thanks for all replies.

To Gregory:
Just few simple newbie questions. Data in L1 cache are stored in lines, right? Is it an equivalent to banks in shared memory? Because in fact it is the same type of memory. As far as I know, there are 32 banks (4 byte) in SM 2.1. Therefore, as you pointed out, B should be aligned differently to cause no ,bank conflicts, in cache (or shared memory, in case of storage in shared memory).

To SPWorley:
I did not know that cache performance in SM 2.0 differs so much form 2.1, thanks.

Many of the details of the Fermi L1 are described in the CUDA Programming Guide:

“A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory.
Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory
transactions whereas memory accesses that are cached in L2 only are serviced with
32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for
example, in the case of scattered memory accesses.
If the size of the words accessed by each thread is more than 4 bytes, a memory
request by a warp is first split into separate 128-byte memory requests that are issued
independently.”

In your example, accesses to B are aligned to a 128 byte cache line.