MATLAB examples on Linux Speed does not seem to scale with graphics card capabilities

Dear all,

I am attempting to speed up some problems I have using CUDA. I have downloaded, installed, compiled and ran the examples supplied with matlab, reporting up to a 6x speedup in the “speed_fft” example with the following hardware:

damien@damien-desktop:~$ cat /proc/cpuinfo 

processor	: 0

vendor_id	: AuthenticAMD

cpu family	: 16

model		: 2

model name	: AMD Phenom(tm) 9950 Quad-Core Processor

stepping	: 3

cpu MHz		: 1300.000

cache size	: 512 KB

<snip>

With the following graphics cards

damien@damien-desktop:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./deviceQuery 

There are 3 devices supporting CUDA

Device 0: "GeForce 8600 GT"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 267714560 bytes

  Number of multiprocessors:					 4

  Number of cores:							   32

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.19 GHz

  Concurrent copy and execution:				 Yes

Device 1: "GeForce 8400 GS"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 268173312 bytes

  Number of multiprocessors:					 2

  Number of cores:							   16

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.92 GHz

  Concurrent copy and execution:				 Yes

Device 2: "nForce 750a SLI"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 266076160 bytes

  Number of multiprocessors:					 1

  Number of cores:							   8

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.20 GHz

  Concurrent copy and execution:				 No

Test PASSED

Press ENTER to exit...

And it all appears to be working fine in Ubuntu 8.10 (had to downgrade compiler to use it).

Now, I am trying to get the “square_me.cu” example working as described in the CUDA MATLAB white paper. My interpretation of it is as follows:

#include "cuda.h"

#include "mex.h"

/* Kernel to square elements of the array on the GPU */

__global__ void square_elements(float* in, float* out, int N)

{

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	if ( idx < N) 

		out[idx] = in[idx] * in[idx];

}

/* Gateway function */

void mexFunction(int nlhs, mxArray *plhs[],

				 int nrhs, const mxArray *prhs[])

{

	int i, j, m, n;

	double *data1 = NULL, *data2 = NULL;

	float *data1f = NULL, *data2f = NULL;

	float *data1f_gpu = NULL, *data2f_gpu = NULL;

	mxClassID category;

	if (nrhs != nlhs)

	{

		mexErrMsgTxt("The number of input and output arguments must be the same.");

	}

	

	// Get the device and it's properties

	int device = 2;

	cudaSetDevice(device);

	cudaDeviceProp properties;

	cudaGetDeviceProperties(&properties, device);

	mexPrintf("Using device %d with %d multiprocessors\n", device, properties.multiProcessorCount);   

	

	for (i = 0; i < nrhs; i++)

	{

		/* Find the dimensions of the data */

		m = mxGetM(prhs[i]);

		n = mxGetN(prhs[i]);

		

		/* Create an mxArray for the output data */

		plhs[i] = mxCreateDoubleMatrix(m, n, mxREAL);

		

		/* Create an input and output data array on the GPU*/

		cudaMalloc( (void **) &data1f_gpu, sizeof(float)*m*n);

		cudaMalloc( (void **) &data2f_gpu, sizeof(float)*m*n);

		

		/* Retrieve the input data */

		data1 = mxGetPr(prhs[i]);

		

		/* Check if the input array is single or double precision */

		category = mxGetClassID(prhs[i]);

		if ( category == mxSINGLE_CLASS)

		{

			/* The input array is single precision, it can be sent directly to the card */

			cudaMemcpy( data1f_gpu, (float*) data1, sizeof(float)*m*n, cudaMemcpyHostToDevice);

		}			  

		

		if ( category == mxDOUBLE_CLASS)

		{

			/* The input array is in double precision, it needs to be converted t

			floats before being sent to the card */

			data1f = (float *) mxMalloc(sizeof(float)*m*n);

			for (j = 0; j < m*n; j++)

			{

				data1f[j] = (float) data1[j];

			}

			cudaMemcpy( data1f_gpu, data1f, sizeof(float)*n*m, cudaMemcpyHostToDevice);

		}

		data2f = (float *) mxMalloc(sizeof(float)*m*n);

			

		/* Compute execution configuration using 128 threads per block */

		dim3 dimBlock(128);

		dim3 dimGrid( (m*n) / dimBlock.x);

		if ( (n*m) % 128 != 0 ) 

			dimGrid.x += 1;

			

		/* Call function on GPU */

		square_elements<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, n*m);

			

		/* Copy result back to host */

		cudaMemcpy( data2f, data2f_gpu, sizeof(float)*n*m,   cudaMemcpyDeviceToHost);

			

		/* Create a pointer to the output data */

		data2 = mxGetPr(plhs[i]);

			

		/* Convert from single to double before returning */

		for (j = 0; j < m*n; j++)

		{

			data2[j] = (double) data2f[j];

		}

			

		/* Clean-up memory on device and host */

		mxFree(data1f);

		mxFree(data2f);

		cudaFree(data1f_gpu);

		cudaFree(data2f_gpu);

	}

}

However, in execution time, it is significantly slower than regular execution, and roughly on-par with my AMD 3800 box. Here’s some examples:

>> a = single(rand(1,10000000));

% Compare with the native matlab implementation:

>> tic; a2 = a.^2; toc

Elapsed time is 0.088954 seconds.

>> tic; a2 = a.^2; toc

Elapsed time is 0.080937 seconds.

>> tic; a2 = a.^2; toc

Elapsed time is 0.081103 seconds.

% Using Card 0:

>> tic; a2 = cuda_square_me(a); toc;

Using device 0 with 4 multiprocessors

Elapsed time is 0.168026 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 0 with 4 multiprocessors

Elapsed time is 0.145517 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 0 with 4 multiprocessors

Elapsed time is 0.154519 seconds.

% Using card 1

>> tic; a2 = cuda_square_me(a); toc;

Using device 1 with 2 multiprocessors

Elapsed time is 0.137157 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 1 with 2 multiprocessors

Elapsed time is 0.136716 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 1 with 2 multiprocessors

Elapsed time is 0.136826 seconds.

% Using Card 2

>> tic; a2 = cuda_square_me(a); toc;

Using device 2 with 1 multiprocessors

Elapsed time is 0.111030 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 2 with 1 multiprocessors

Elapsed time is 0.133819 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 2 with 1 multiprocessors

Elapsed time is 0.123133 seconds.

Most interestingly, the results for the cuda implementation DO NOT change on which graphics card I am using.

This is my first day using CUDA, and am still getting my head around blocks, threads etc, but clearly, my application is not scaling at all - what am I doing wrong? Or is it likely some sort of memory bandwidth issue? Or is it simply a 2-year old graphics card against a fairly modern processor?

If I can work out how to scale it all correctly, I’d probably stick two 9800’s in this thing tomorrow :)

Cheers

Damien

The kernel you are executing on the GPU is going to be fetching from memory most of the time, not doing much in the way of calculation, so there probably wouldn’t be much of a speedup between a CPU and the GPU.

He All,
I would know how to compile this example in Matlab, what is the command?
I tried this command : " nvmex -f nvopts.sh square_me.cu -output toto.mex -IC:\usr\local\cuda\include -LC:\usr\local\cuda\lib -lcudart "
but it doesn’t work, the message error is : " mex: no file name given " and " ??? Error using ==> nvmex at 206 Unable to complete successfully"

Thank you.