Six Ways to SAXPY

Originally published at: https://developer.nvidia.com/blog/six-ways-saxpy/

This post is a GPU program chrestomathy. What’s a Chrestomathy, you ask? In computer programming, a program chrestomathy is a collection of similar programs written in various programming languages, for the purpose of demonstrating differences in syntax, semantics and idioms for each language. [Wikipedia] There are several good examples of program chrestomathies on the web, including Rosetta Code and NBabel, which demonstrates gravitational N-body simulation in…

Hi Mark,

I have just started looking into CUDA and downloaded and installed Visual Studio on my Windows machine (GPU: Nvidia Geforce 940M CUDA:7.5 MS VS 2013). I have started with vector add and then read about SAXPY.
Where is the definition of the SAXPY code?

cublas_v2.h contains:
#define cublasSaxpy cublasSaxpy_v2

cublas_api.h contains
CUBLASAPI cublasStatus_t CUBLASWINAPI cublasSaxpy_v2 (..);

but where is the actual code where processing occurs?

I am confused as all blogs just use the api call but I wanted to see the actual code. I am new to cpp as well, can you help me with the location of its definition and not declaration in CUDA samples

Hi Deepak, cuBLAS is not an open-source library, so you can't see the implementation code for its routines.

Hi Mark,

Thanks for your reply.Yes I realised that after a while. Actually I am working on a project about kernel fusions especially CUDAs. Wanted to start small like DAXPY and DGEMM. Will try writing the code myself now and use other Cuda samples like BlackScholes(full implementation is present) or Rodinia benchmarks.
Thanks for all the elaborate tutorials.

Deepak.

Hello Mark,

Thank you for providing such a beautiful article on this topic. Recently I have started learning about parallel computing and gpu programming and after reading your article and trying out the code I stumbled over an error:

void saxpy(int n, float a, float *x, float *y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}

The following snippet of code produces an error at compilation if *x and *y are dinamically allocated using malloc

The error:

PGC-S-0155-Cannot determine bounds for array x (saxpy.c: 26)
PGC-S-0155-Cannot determine bounds for array y (saxpy.c: 26)
main:
26, Generating copy(x[:],y[:])
PGC/x86-64 Linux 16.5-0: compilation completed with severe errors

Hi Christian, the following code works for me with PGI 16.5:

void saxpy(int n, float a, float *restrict x, float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}

int main(void)
{
const int n = 1 << 20;
float *x = new float[n];
float *y = new float[n];

for (int i = 0; i < n; ++i) {
x[i] = (float)i / (float)n;
y[i] = (float)i;
}

// Perform SAXPY on 1M elements
saxpy(n, 2.0, x, y);

delete [] x;
delete [] y;
}

Note that I needed to add the "restrict" flags to tell the compiler that there is no aliasing in the function on these pointers. I updated the post text to reflect this. See https://devblogs.nvidia.com...

Compiled with:

pgc++ -acc -ta=tesla -Minfo=accel saxpy_acc.cpp

Compiler output:

saxpy(int, float, float *, float *):
2, Generating copyin(x[:n])
Generating copy(y[:n])
4, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
4, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

Hi there Mark,

Thank you for your reply, it works like a charm.

Inspired by your article, I've done and benchmarked many more ways (currently 24) to SAXPY: https://github.com/bennylp/...

Fun! I can't say SAXPY is really a good computation benchmark. It's bandwidth bound. However, in your comparison you are basically demonstrating which frameworks have more or less overhead.

Thank you!

But anyway, does it mean that the complexity of walking through an array is about O(1)? Yhank you!

on C++ loop [cpu] (src/saxpy_cpu.cpp) you've got 40 ms with Intel i7-6700 CPU @ 3.40GHz but I've got 3.3 ms with Intel i3-2370M CPU @ 2.40GHz https://github.com/Evegen55... Why your code is so slow?

https://uploads.disquscdn.c... Here is an image

I use N=2^26 not 2^20

Change yours to use N=2^26

Thank you! I've got 220 ms :-)

The question is a bit off-topic, but is the Boost library the original source for this syntax wherever it appears? I always wondered if Scala had been inspired by another language/library for this argument syntax as well.

Not sure which syntax you mean. As mentioned in the post, SAXPY comes from BLAS.