Dear all,

I am attempting to speed up some problems I have using CUDA. I have downloaded, installed, compiled and ran the examples supplied with matlab, reporting up to a 6x speedup in the “speed_fft” example with the following hardware:

```
damien@damien-desktop:~$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : AMD Phenom(tm) 9950 Quad-Core Processor
stepping : 3
cpu MHz : 1300.000
cache size : 512 KB
<snip>
```

With the following graphics cards

```
damien@damien-desktop:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./deviceQuery
There are 3 devices supporting CUDA
Device 0: "GeForce 8600 GT"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 267714560 bytes
Number of multiprocessors: 4
Number of cores: 32
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.19 GHz
Concurrent copy and execution: Yes
Device 1: "GeForce 8400 GS"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 268173312 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.92 GHz
Concurrent copy and execution: Yes
Device 2: "nForce 750a SLI"
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 266076160 bytes
Number of multiprocessors: 1
Number of cores: 8
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.20 GHz
Concurrent copy and execution: No
Test PASSED
Press ENTER to exit...
```

And it all appears to be working fine in Ubuntu 8.10 (had to downgrade compiler to use it).

Now, I am trying to get the “square_me.cu” example working as described in the CUDA MATLAB white paper. My interpretation of it is as follows:

```
#include "cuda.h"
#include "mex.h"
/* Kernel to square elements of the array on the GPU */
__global__ void square_elements(float* in, float* out, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ( idx < N)
out[idx] = in[idx] * in[idx];
}
/* Gateway function */
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
int i, j, m, n;
double *data1 = NULL, *data2 = NULL;
float *data1f = NULL, *data2f = NULL;
float *data1f_gpu = NULL, *data2f_gpu = NULL;
mxClassID category;
if (nrhs != nlhs)
{
mexErrMsgTxt("The number of input and output arguments must be the same.");
}
// Get the device and it's properties
int device = 2;
cudaSetDevice(device);
cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
mexPrintf("Using device %d with %d multiprocessors\n", device, properties.multiProcessorCount);
for (i = 0; i < nrhs; i++)
{
/* Find the dimensions of the data */
m = mxGetM(prhs[i]);
n = mxGetN(prhs[i]);
/* Create an mxArray for the output data */
plhs[i] = mxCreateDoubleMatrix(m, n, mxREAL);
/* Create an input and output data array on the GPU*/
cudaMalloc( (void **) &data1f_gpu, sizeof(float)*m*n);
cudaMalloc( (void **) &data2f_gpu, sizeof(float)*m*n);
/* Retrieve the input data */
data1 = mxGetPr(prhs[i]);
/* Check if the input array is single or double precision */
category = mxGetClassID(prhs[i]);
if ( category == mxSINGLE_CLASS)
{
/* The input array is single precision, it can be sent directly to the card */
cudaMemcpy( data1f_gpu, (float*) data1, sizeof(float)*m*n, cudaMemcpyHostToDevice);
}
if ( category == mxDOUBLE_CLASS)
{
/* The input array is in double precision, it needs to be converted t
floats before being sent to the card */
data1f = (float *) mxMalloc(sizeof(float)*m*n);
for (j = 0; j < m*n; j++)
{
data1f[j] = (float) data1[j];
}
cudaMemcpy( data1f_gpu, data1f, sizeof(float)*n*m, cudaMemcpyHostToDevice);
}
data2f = (float *) mxMalloc(sizeof(float)*m*n);
/* Compute execution configuration using 128 threads per block */
dim3 dimBlock(128);
dim3 dimGrid( (m*n) / dimBlock.x);
if ( (n*m) % 128 != 0 )
dimGrid.x += 1;
/* Call function on GPU */
square_elements<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, n*m);
/* Copy result back to host */
cudaMemcpy( data2f, data2f_gpu, sizeof(float)*n*m, cudaMemcpyDeviceToHost);
/* Create a pointer to the output data */
data2 = mxGetPr(plhs[i]);
/* Convert from single to double before returning */
for (j = 0; j < m*n; j++)
{
data2[j] = (double) data2f[j];
}
/* Clean-up memory on device and host */
mxFree(data1f);
mxFree(data2f);
cudaFree(data1f_gpu);
cudaFree(data2f_gpu);
}
}
```

However, in execution time, it is significantly slower than regular execution, and roughly on-par with my AMD 3800 box. Here’s some examples:

```
>> a = single(rand(1,10000000));
% Compare with the native matlab implementation:
>> tic; a2 = a.^2; toc
Elapsed time is 0.088954 seconds.
>> tic; a2 = a.^2; toc
Elapsed time is 0.080937 seconds.
>> tic; a2 = a.^2; toc
Elapsed time is 0.081103 seconds.
% Using Card 0:
>> tic; a2 = cuda_square_me(a); toc;
Using device 0 with 4 multiprocessors
Elapsed time is 0.168026 seconds.
>> tic; a2 = cuda_square_me(a); toc;
Using device 0 with 4 multiprocessors
Elapsed time is 0.145517 seconds.
>> tic; a2 = cuda_square_me(a); toc;
Using device 0 with 4 multiprocessors
Elapsed time is 0.154519 seconds.
% Using card 1
>> tic; a2 = cuda_square_me(a); toc;
Using device 1 with 2 multiprocessors
Elapsed time is 0.137157 seconds.
>> tic; a2 = cuda_square_me(a); toc;
Using device 1 with 2 multiprocessors
Elapsed time is 0.136716 seconds.
>> tic; a2 = cuda_square_me(a); toc;
Using device 1 with 2 multiprocessors
Elapsed time is 0.136826 seconds.
% Using Card 2
>> tic; a2 = cuda_square_me(a); toc;
Using device 2 with 1 multiprocessors
Elapsed time is 0.111030 seconds.
>> tic; a2 = cuda_square_me(a); toc;
Using device 2 with 1 multiprocessors
Elapsed time is 0.133819 seconds.
>> tic; a2 = cuda_square_me(a); toc;
Using device 2 with 1 multiprocessors
Elapsed time is 0.123133 seconds.
```

Most interestingly, the results for the cuda implementation DO NOT change on which graphics card I am using.

This is my first day using CUDA, and am still getting my head around blocks, threads etc, but clearly, my application is not scaling at all - what am I doing wrong? Or is it likely some sort of memory bandwidth issue? Or is it simply a 2-year old graphics card against a fairly modern processor?

If I can work out how to scale it all correctly, I’d probably stick two 9800’s in this thing tomorrow :)

Cheers

Damien