Why CuBlas & saxpy cannot reach idea max FLOPS?

Hi all:

I am trying to profile how fast my GPU can reach.

I copy a simple saxpy program but cannot figure out the result.

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a * x[i] + y[i];
}

int main(void)
{
  int N = 20 * (1 << 20);
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));
  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  /*measure time start*/
  saxpy<<<(N+511)/512, 512>>>(N, 2.0f, d_x, d_y);
  /*measure time stop*/
  ...
}

EX, it sadly reaches max FLOPS only 0.1%…

So I try to unroll the loop; it means to do more work within each thread, as:

__global__
void saxpy(int n, float a, float *x, float *y, float *z)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  int offset = 128 * i; 
  if (i < n) 
    for (int k = 0; k < 128; k++) {
      z[offset + k] = a * x[offset + k] + y[offset + k];
    }
}

In this example I do 128 calculations within a thread, instead of just one as origin.

However, it makes thing worse… Notice that the # of blocks are reduced accordingly so total # of op are the same.

Finally, I tries to use a temp variable (or a temp array) to store the result.
Now it becomes much faster & close to the theoretic max FLOPS.

Even that I think the way is wrong since it cannot get back the result.

tmp = a * x[offset + k] + y[offset + k];

My question is:

  1. Could I measure by this way? why temp variable(or array) makes it mush faster?
  2. Why unroll the loop makes it slower?
  3. I used Cublus’s API = cublasSaxpy & the result is close the 0.1% also; dose it implies something?

Also, I found the example of CUDA named: 0_Simple/matrixMul has a profile program also.
It can reach 1330.88 GFlop/s compared to ~60 GFlops of this even simpler test program. It makes me confused…
.

Thanks in advance!

How much you get using Thrust’s saxpy? Solution #4 at https://devblogs.nvidia.com/six-ways-saxpy/
What does NVidia Visual Profiler say about these runs you tried? It will suggest bottlenecks in each kernel.

You’re very much memory limited in your kernel.

EDIT:
your adding of the 128 element stride actually makes things worse, as you are no longer coalescing your memory accesses. You will be requesting data from multiple cache lines per warp.

A vector 1D SAXPY will not benefit much from CUDA acceleration as each vector element is touched only once. The real gains come in with big matrices, where each element must be accessed many times. The use of shared memory can help a lot here.