MATLAB examples on Linux Speed does not seem to scale with graphics card capabilities

damien_d · February 5, 2009, 9:31am

Dear all,

I am attempting to speed up some problems I have using CUDA. I have downloaded, installed, compiled and ran the examples supplied with matlab, reporting up to a 6x speedup in the “speed_fft” example with the following hardware:

damien@damien-desktop:~$ cat /proc/cpuinfo 

processor	: 0

vendor_id	: AuthenticAMD

cpu family	: 16

model		: 2

model name	: AMD Phenom(tm) 9950 Quad-Core Processor

stepping	: 3

cpu MHz		: 1300.000

cache size	: 512 KB

<snip>

With the following graphics cards

damien@damien-desktop:~/NVIDIA_CUDA_SDK/bin/linux/release$ ./deviceQuery 

There are 3 devices supporting CUDA

Device 0: "GeForce 8600 GT"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 267714560 bytes

  Number of multiprocessors:					 4

  Number of cores:							   32

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.19 GHz

  Concurrent copy and execution:				 Yes

Device 1: "GeForce 8400 GS"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 268173312 bytes

  Number of multiprocessors:					 2

  Number of cores:							   16

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									0.92 GHz

  Concurrent copy and execution:				 Yes

Device 2: "nForce 750a SLI"

  Major revision number:						 1

  Minor revision number:						 1

  Total amount of global memory:				 266076160 bytes

  Number of multiprocessors:					 1

  Number of cores:							   8

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   16384 bytes

  Total number of registers available per block: 8192

  Warp size:									 32

  Maximum number of threads per block:		   512

  Maximum sizes of each dimension of a block:	512 x 512 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  262144 bytes

  Texture alignment:							 256 bytes

  Clock rate:									1.20 GHz

  Concurrent copy and execution:				 No

Test PASSED

Press ENTER to exit...

And it all appears to be working fine in Ubuntu 8.10 (had to downgrade compiler to use it).

Now, I am trying to get the “square_me.cu” example working as described in the CUDA MATLAB white paper. My interpretation of it is as follows:

#include "cuda.h"

#include "mex.h"

/* Kernel to square elements of the array on the GPU */

__global__ void square_elements(float* in, float* out, int N)

{

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	if ( idx < N) 

		out[idx] = in[idx] * in[idx];

}

/* Gateway function */

void mexFunction(int nlhs, mxArray *plhs[],

				 int nrhs, const mxArray *prhs[])

{

	int i, j, m, n;

	double *data1 = NULL, *data2 = NULL;

	float *data1f = NULL, *data2f = NULL;

	float *data1f_gpu = NULL, *data2f_gpu = NULL;

	mxClassID category;

	if (nrhs != nlhs)

	{

		mexErrMsgTxt("The number of input and output arguments must be the same.");

	}

	

	// Get the device and it's properties

	int device = 2;

	cudaSetDevice(device);

	cudaDeviceProp properties;

	cudaGetDeviceProperties(&properties, device);

	mexPrintf("Using device %d with %d multiprocessors\n", device, properties.multiProcessorCount);   

	

	for (i = 0; i < nrhs; i++)

	{

		/* Find the dimensions of the data */

		m = mxGetM(prhs[i]);

		n = mxGetN(prhs[i]);

		

		/* Create an mxArray for the output data */

		plhs[i] = mxCreateDoubleMatrix(m, n, mxREAL);

		

		/* Create an input and output data array on the GPU*/

		cudaMalloc( (void **) &data1f_gpu, sizeof(float)*m*n);

		cudaMalloc( (void **) &data2f_gpu, sizeof(float)*m*n);

		

		/* Retrieve the input data */

		data1 = mxGetPr(prhs[i]);

		

		/* Check if the input array is single or double precision */

		category = mxGetClassID(prhs[i]);

		if ( category == mxSINGLE_CLASS)

		{

			/* The input array is single precision, it can be sent directly to the card */

			cudaMemcpy( data1f_gpu, (float*) data1, sizeof(float)*m*n, cudaMemcpyHostToDevice);

		}			  

		

		if ( category == mxDOUBLE_CLASS)

		{

			/* The input array is in double precision, it needs to be converted t

			floats before being sent to the card */

			data1f = (float *) mxMalloc(sizeof(float)*m*n);

			for (j = 0; j < m*n; j++)

			{

				data1f[j] = (float) data1[j];

			}

			cudaMemcpy( data1f_gpu, data1f, sizeof(float)*n*m, cudaMemcpyHostToDevice);

		}

		data2f = (float *) mxMalloc(sizeof(float)*m*n);

			

		/* Compute execution configuration using 128 threads per block */

		dim3 dimBlock(128);

		dim3 dimGrid( (m*n) / dimBlock.x);

		if ( (n*m) % 128 != 0 ) 

			dimGrid.x += 1;

			

		/* Call function on GPU */

		square_elements<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, n*m);

			

		/* Copy result back to host */

		cudaMemcpy( data2f, data2f_gpu, sizeof(float)*n*m,   cudaMemcpyDeviceToHost);

			

		/* Create a pointer to the output data */

		data2 = mxGetPr(plhs[i]);

			

		/* Convert from single to double before returning */

		for (j = 0; j < m*n; j++)

		{

			data2[j] = (double) data2f[j];

		}

			

		/* Clean-up memory on device and host */

		mxFree(data1f);

		mxFree(data2f);

		cudaFree(data1f_gpu);

		cudaFree(data2f_gpu);

	}

}

However, in execution time, it is significantly slower than regular execution, and roughly on-par with my AMD 3800 box. Here’s some examples:

>> a = single(rand(1,10000000));

% Compare with the native matlab implementation:

>> tic; a2 = a.^2; toc

Elapsed time is 0.088954 seconds.

>> tic; a2 = a.^2; toc

Elapsed time is 0.080937 seconds.

>> tic; a2 = a.^2; toc

Elapsed time is 0.081103 seconds.

% Using Card 0:

>> tic; a2 = cuda_square_me(a); toc;

Using device 0 with 4 multiprocessors

Elapsed time is 0.168026 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 0 with 4 multiprocessors

Elapsed time is 0.145517 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 0 with 4 multiprocessors

Elapsed time is 0.154519 seconds.

% Using card 1

>> tic; a2 = cuda_square_me(a); toc;

Using device 1 with 2 multiprocessors

Elapsed time is 0.137157 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 1 with 2 multiprocessors

Elapsed time is 0.136716 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 1 with 2 multiprocessors

Elapsed time is 0.136826 seconds.

% Using Card 2

>> tic; a2 = cuda_square_me(a); toc;

Using device 2 with 1 multiprocessors

Elapsed time is 0.111030 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 2 with 1 multiprocessors

Elapsed time is 0.133819 seconds.

>> tic; a2 = cuda_square_me(a); toc;

Using device 2 with 1 multiprocessors

Elapsed time is 0.123133 seconds.

Most interestingly, the results for the cuda implementation DO NOT change on which graphics card I am using.

This is my first day using CUDA, and am still getting my head around blocks, threads etc, but clearly, my application is not scaling at all - what am I doing wrong? Or is it likely some sort of memory bandwidth issue? Or is it simply a 2-year old graphics card against a fairly modern processor?

If I can work out how to scale it all correctly, I’d probably stick two 9800’s in this thing tomorrow :)

Cheers

Damien

computerulz · February 9, 2009, 1:20am

The kernel you are executing on the GPU is going to be fetching from memory most of the time, not doing much in the way of calculation, so there probably wouldn’t be much of a speedup between a CPU and the GPU.

CeMike · May 26, 2009, 7:58am

He All,
I would know how to compile this example in Matlab, what is the command?
I tried this command : " nvmex -f nvopts.sh square_me.cu -output toto.mex -IC:\usr\local\cuda\include -LC:\usr\local\cuda\lib -lcudart "
but it doesn’t work, the message error is : " mex: no file name given " and " ??? Error using ==> nvmex at 206 Unable to complete successfully"

Thank you.

Topic		Replies	Views
Cuda code performance CUDA Programming and Performance	14	3039	December 16, 2014
CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP CUDA Programming and Performance	5	2498	February 18, 2011
CUDA very slow performance CUDA Programming and Performance	21	16250	March 6, 2020
Roughly the same processing time for global and shared mem CUDA Programming and Performance	9	1827	June 6, 2010
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4344	February 7, 2012
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	12853	July 9, 2008
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	662	April 4, 2017
Error handling in CUDA when using cudaMemcpy3D CUDA Programming and Performance	12	11915	December 22, 2010
Is CUDA really that fast? CUDA Programming and Performance	17	11629	September 21, 2009
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	63989	April 20, 2011

MATLAB examples on Linux Speed does not seem to scale with graphics card capabilities

Related Topics