GPU architecture changes : one kernel per arch ?

Hi,

I took again a 2 years old app wich had some kernel “optimized” using shared memory on a GF 8800 GTX. At the time the kernels were written, we got 13x speedup from naive to shared. With the GT200 (Tesla C1060) we still got 3.3x speedup, but I just realized today that on a Fermi board (Tesla C2060), thanks to the cache, the naive kernel is really faster than the “optimized” one !

Did anybody had similar results on some apps ? Are we going toward the end of shared memory ?

And finally do we have to write more than one kernel and dynamically switch at execution with a GPU detection ?

Here are the results I get :

GeForce 8800 GTX :

Time naive 5.60ms

Time shared 0.41ms

Speedup is 13.65

C1060 :

Time naive 1.38ms

Time shared 0.42ms

Speedup is 3.32

C2060 :

Time naive 0.12ms

Time shared 0.20ms

Speedup is 0.61

And the sample test code that produced the results :

#include <string.h>

#include <stdlib.h>

#include <math.h>

#include <stdio.h>

#include <sys/time.h>

#define NP 128

#define LBOX 6.f

#define DX ((float)LBOX/NP)

#define ITERATIONS 10

//========================================================================

__global__ void k_shared( float *pot, float *fx )

{

  int tx=threadIdx.x;

  int bx=blockIdx.x;

  int by=blockIdx.y;

__shared__ float spot[NP];

int blockCoord = bx * NP * NP + by*NP;

  int cellCoord = blockCoord + tx;

spot[tx]=pot[cellCoord];

__syncthreads();

float x1 = spot[ ((unsigned int)(tx - 1 )) & (NP-1) ];

  float x2 = spot[ (tx + 1 ) & (NP-1) ];

fx[cellCoord]=(x2-x1)/(2.f*DX);

}

__global__ void k_naive( float *pot, float *fx )

{

  int tx=threadIdx.x;

  int bx=blockIdx.x;

  int by=blockIdx.y;

int cellCoord = bx * NP * NP + by*NP + tx;

fx[cellCoord]=(pot[(cellCoord+1) & (NP - 1)]-pot[(cellCoord-1) & (NP - 1)])/(2.f*DX);

}

int main(int argc, char **argv) {

float *pot,*fx;

float elapsedTime_naive = -1;

  float elapsedTime_shared = -1;

  cudaEvent_t start_event, stop_event;

dim3 dimGridForce(NP,NP);

  dim3 dimBlockForce(NP);

/* Alloc */

  cudaMalloc((void**)&pot,sizeof(float)*NP*NP*NP);

  cudaMalloc((void**)&fx,sizeof(float)*NP*NP*NP);

/* Prepare event for timing */

  cudaEventCreate(&start_event);

  cudaEventCreate(&stop_event);

/*********************

   * NAIVE

   */

// Warm-up

  k_naive<<<dimGridForce,dimBlockForce>>>(pot,fx);

// Bench

  cudaEventRecord(start_event, 0);

  for(int i=0; i<ITERATIONS; i++) {

    k_naive<<<dimGridForce,dimBlockForce>>>(pot,fx);

  }

  cudaEventRecord(stop_event, 0);

  cudaEventSynchronize(stop_event);

  cudaEventElapsedTime(&elapsedTime_naive, start_event, stop_event);

/*********************

   * SHARED

   */

// Warm-up

  k_shared<<<dimGridForce,dimBlockForce>>>(pot,fx);

// Bench

  cudaEventRecord(start_event, 0);

  for(int i=0; i<ITERATIONS; i++) {

    k_shared<<<dimGridForce,dimBlockForce>>>(pot,fx);

  }

  cudaEventRecord(stop_event, 0);

  cudaEventSynchronize(stop_event);

  cudaEventElapsedTime(&elapsedTime_shared, start_event, stop_event);

printf("Time naive  %0.2fms\nTime shared %0.2fms\nSpeedup is %0.2f\n",

         elapsedTime_naive/ITERATIONS,elapsedTime_shared/ITERATIONS,elapsedTime_naive/elapsedTime_shared);

}

Did you play with register number and block size of Fermi? Actually cause looks like you only use shared memory once it is not so big surprize. Caches help with no coalesce access. You could also use textures too.

I don’t have any good suggestions for your other questions, but wanted to comment on this: Shared memory is useful to cover for limited memory controllers (as in the 8800 GTX) and GPU without global memory caching (like the C1060), but it has other uses as well. The best way for threads in a block to exchange data is still via shared memory, so I hope it doesn’t go away anytime soon. :)

Note that architecture-specific optimizations can conveniently be handled using the [font=“Courier New”]–generate-code[/font] command line parameter and the [font=“Courier New”]CUDA_ARCH[/font] macro:

__global__ void k_unified( float *pot, float *fx )

{

  unsigned int tx=threadIdx.x;

  unsigned int bx=blockIdx.x;

  unsigned int by=blockIdx.y;

unsigned int blockCoord = bx * NP * NP + by*NP;

  unsigned int cellCoord = blockCoord + tx;

#if __CUDA_ARCH__ < 200

  __shared__ float spot[NP];

spot[tx]=pot[cellCoord];

__syncthreads();

float x1 = spot[ (tx - 1 ) & (NP-1) ];

  float x2 = spot[ (tx + 1 ) & (NP-1) ];

#else

  float x1 = pot[(cellCoord-1) & (NP - 1)];

  float x2 = pot[(cellCoord+1) & (NP - 1)];

#endif

fx[cellCoord]=(x2-x1)/(2.f*DX);

}

Compile with [font=“Courier New”]nvcc --generate-code=arch=compute_10,code=sm_10 --generate-code=arch=compute_13,code=sm_13 --generate-code=arch=compute_20,code=sm_20 test.cu[/font] and voilà , you have optimized kernels for each of your GPUs with runtime selection with little additional effort over a single kernel.

seibert: of course I was a little provocative when suggesting that we would totally get rid of shared memory, there’s certainly a lot of kernel that can benefit from it. Anyway my situation is quite awkward in term of backward compatibility. I wonder if all Cuda SDK examples have been rewrited for Fermi ?

tera: thanks for the hint on conditionnal compilation :-)

The already mentioned conditional compilation is handy. I have also run into a few cases where the Fermi-optimized computation differs enough from the old version (i.e. different numbers of kernels need to be called) it does make sense to simply write 2 versions of the computation and choose which to use at runtime with information from cudaDeviceGetProperties. This does bloat code a bit and increases maintenance headaches, but those are worth it if you get large performance boosts.

I have found that the cache can replace the shared (or texture) memory in many cases. For example, with the cache, there is now no need to manually cache often used, short, arrays in shared memory - a much simpler code that simply reads from the global mem pointer will perform just as fast.

Another case is that shared memory is no longer needed to coalesce reads of odd sized objects (like float3). The cache does that automatically.

As for other cases? It would be an interesting experiment to compare some of the highly tuned convolution kernels with and without shared memory - I wonder how well the cache would perform there?

The big thing to keep in mind when heavily using the cache is to switch the cache setting so you get 48k L1 and 16k smem instead of the default of more shared memory.

Obviously, one still needs shared memory for reductions and the like, but the L1 cache can certainly replace it in many circumstances.