When porting the marchine learning framework I use to CUDA, I was very disappointed to see that for the type of operations I’m doing, CUDA is actually slower that CPU code. Most of my operations are matrix-vector multiplications, with sizes of the order of hundreds (ie 500x100). In order to see from which size CUBLAS sgemv is faster than CBLAS sgemv, I wrote this small benchmark :

[codebox]#include <time.h>

#include <cutil.h>

#include <cublas.h>

#include <mkl_cblas.h>

int main(int argc, char** argv)

{

int nbIter = 10000;

int m;

int n = 128;

for (int j = 0; j < 10; ++j) {

```
m = 16 << j;
```

// n = m;

```
printf("-------------\nEvaluating %i iterations for a matrix %ix%i\n", nbIter, m, n);
```

float time;

```
float *mat, *x, *y;
```

float *data = (float*) malloc(sizeof(float) * m * n);

for (int i = 0; i < m*n; ++i)

```
data[i] = ((float)i) / ((float)(m * n));
```

unsigned int timer = 0;

// cuda test

```
CUT_SAFE_CALL( cutCreateTimer( &timer));
```

CUDA_SAFE_CALL( cudaMalloc((void**) &mat, m * n * sizeof(float)) );

```
CUDA_SAFE_CALL( cudaMalloc((void**) &x, n * sizeof(float)) );
CUDA_SAFE_CALL( cudaMalloc((void**) &y, m * sizeof(float)) );
```

CUDA_SAFE_CALL( cudaMemcpy( mat, data, m * n * sizeof(float), cudaMemcpyHostToDevice) );

```
CUDA_SAFE_CALL( cudaMemcpy( x, data, n * sizeof(float), cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy( y, data, m * sizeof(float), cudaMemcpyHostToDevice) );
```

CUT_SAFE_CALL( cutStartTimer( timer));

for (int i = 0; i < nbIter; ++i)

```
{
cublasSgemv('t', n, m, 1, mat, n, x, 1, 1, y, 1);
}
```

CUDA_SAFE_CALL( cudaThreadSynchronize() );

```
CUT_SAFE_CALL( cutStopTimer( timer));
time = cutGetTimerValue( timer);
```

// output results

```
printf( "CUDA Time: %f (ms)\n", time);
```

CUDA_SAFE_CALL( cudaFree(mat) );

```
CUDA_SAFE_CALL( cudaFree(x) );
CUDA_SAFE_CALL( cudaFree(y) );
```

CUT_SAFE_CALL( cutDeleteTimer( timer));

// cpu test

```
mat = (float*) malloc(m * n * sizeof(float));
x = (float*) malloc(n*sizeof(float));
y = (float*) malloc(m*sizeof(float));
```

memcpy(mat, data, m * n * sizeof(float));

```
memcpy(x, data, n * sizeof(float));
memcpy(y, data, m * sizeof(float));
```

clock_t start = clock();

for (int i = 0; i < nbIter; ++i)

```
{
cblas_sgemv(CblasColMajor, CblasTrans, n, m, 1, mat, n, x, 1, 1, y, 1);
}
```

printf(“CPU Time: %f (ms)\n”, (clock() - start) * 1000 / (float) CLOCKS_PER_SEC);

free(mat);

```
free(x);
free(y);
```

free(data);

}

}[/codebox]

The second dimension is fixed because this is usually what I have in my experiments. Here are the results (the CPU timer is far less accurate than the GPU one) :

[codebox]-------------

Evaluating 10000 iterations for a matrix 16x128

CUDA Time: 214.681000 (ms)

CPU Time: 10.000000 (ms)

Evaluating 10000 iterations for a matrix 32x128

CUDA Time: 278.380005 (ms)

CPU Time: 10.000000 (ms)

Evaluating 10000 iterations for a matrix 64x128

CUDA Time: 278.065002 (ms)

CPU Time: 20.000000 (ms)

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.746002 (ms)

CPU Time: 30.000000 (ms)

Evaluating 10000 iterations for a matrix 256x128

CUDA Time: 278.177002 (ms)

CPU Time: 70.000000 (ms)

Evaluating 10000 iterations for a matrix 512x128

CUDA Time: 279.446991 (ms)

CPU Time: 140.000000 (ms)

Evaluating 10000 iterations for a matrix 1024x128

CUDA Time: 289.652008 (ms)

CPU Time: 310.000000 (ms)

Evaluating 10000 iterations for a matrix 2048x128

CUDA Time: 374.023987 (ms)

CPU Time: 630.000000 (ms)

Evaluating 10000 iterations for a matrix 4096x128

CUDA Time: 680.843018 (ms)

CPU Time: 1290.000000 (ms)

Evaluating 10000 iterations for a matrix 8192x128

CUDA Time: 1254.005005 (ms)

CPU Time: 2590.000244 (ms)[/codebox]

I also ran the same test for square matrix :

[codebox]-------------

Evaluating 10000 iterations for a matrix 16x16

CUDA Time: 89.642998 (ms)

CPU Time: 10.000000 (ms)

Evaluating 10000 iterations for a matrix 32x32

CUDA Time: 107.869003 (ms)

CPU Time: 0.000000 (ms)

Evaluating 10000 iterations for a matrix 64x64

CUDA Time: 164.585999 (ms)

CPU Time: 20.000000 (ms)

Evaluating 10000 iterations for a matrix 128x128

CUDA Time: 277.773987 (ms)

CPU Time: 30.000000 (ms)

Evaluating 10000 iterations for a matrix 256x256

CUDA Time: 506.329987 (ms)

CPU Time: 120.000000 (ms)

Evaluating 10000 iterations for a matrix 512x512

CUDA Time: 1154.552002 (ms)

CPU Time: 530.000000 (ms)

Evaluating 10000 iterations for a matrix 1024x1024

CUDA Time: 3484.691895 (ms)

CPU Time: 1960.000000 (ms)

Evaluating 10000 iterations for a matrix 2048x2048

CUDA Time: 7111.210938 (ms)

CPU Time: 17180.000000 (ms)

Evaluating 10000 iterations for a matrix 4096x4096

CUDA Time: 21080.605469 (ms)

CPU Time: 69410.000000 (ms)

Evaluating 10000 iterations for a matrix 8192x8192

CUDA Time: 80645.937500 (ms)

CPU Time: 308120.000000 (ms)[/codebox]

It seems that CUDA starts to be interesting once your sizes are above the thousand (maybe event 2048). Do you guys have similar results ? Is my way of benchmarking the thing valid ? I know that the CPU timer is not accurate at all, but I don’t need very precise measurements, just the order of magnitude (is it 10x slower or 3x faster).

My hardware is an Intel Core i7 920 (4x2.67 GHz, Hyper Threading, 8 Mo L3), 8 Go of DDR3-1600, 2x GTX 275 (but the above code only uses one obviously). I use CUDA 2.3 and Intel MKL 9.0.

I would appreciate if you could run the above code on your setup, and post here your results, with your hardware specs. Thanks !