CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP

wakeuplola · February 15, 2011, 1:58pm

Hello,

I have an HP (Laptop) with Intel(R) Core™2 Extreme CPU Q9300 @2.53GHz, 8 GB- RAM, 64-bit Windows 7 Operating System. As a GPU, I have a QUADRO FX 3700M. I’m using MATLAB release 2009a. All CUDA files were compiled by Visual Studio 2008.

Details:

Driver version: 261.28
CUDA Cores: 128
Core clock: 550 MHz
Shader clock: 1375 MHz
Memory clock: 799 MHz (1598 MHz)
Memory interface: 256-bit
Total available graphics: 4607 MB
Dedicated video memory: 1024 MB GDRD3
System video memory: 0 MB
shared system memory: 3583 MB
Video BIOS version: 62.92.51.00.05
IRQ: 16
Bus: PCI Express x16

My problem is the following: I implemented a 1-D FFT algorithm as programmed in the following plug-in

The algorithm works alright and the results are correct, but I compared the performance using MATLAB’s native fft alone and I get no advantage! I tried the same using the simplest examples like square_me.cu(from the white paper) and “speed_fft.m” (from the plugin) which are made precisely to show off the advantage of using the GPU.

Am I doomed to have absolutely no advantage?, or is there something I can try to fix?

Thanks in advance

Crankie · February 16, 2011, 5:49am

Hello,

I have an HP (Laptop) with Intel(R) Core™2 Extreme CPU Q9300 @2.53GHz, 8 GB- RAM, 64-bit Windows 7 Operating System. As a GPU, I have a QUADRO FX 3700M. I’m using MATLAB release 2009a. All CUDA files were compiled by Visual Studio 2008.

Details:

Driver version: 261.28

CUDA Cores: 128

Core clock: 550 MHz

Shader clock: 1375 MHz

Memory clock: 799 MHz (1598 MHz)

Memory interface: 256-bit

Total available graphics: 4607 MB

Dedicated video memory: 1024 MB GDRD3

System video memory: 0 MB

shared system memory: 3583 MB

Video BIOS version: 62.92.51.00.05

IRQ: 16

Bus: PCI Express x16

My problem is the following: I implemented a 1-D FFT algorithm as programmed in the following plug-in

Game Development Tools, SDKs, and Partner Engines | NVIDIA Developer

The algorithm works alright and the results are correct, but I compared the performance using MATLAB’s native fft alone and I get no advantage! I tried the same using the simplest examples like square_me.cu(from the white paper) and “speed_fft.m” (from the plugin) which are made precisely to show off the advantage of using the GPU.

Am I doomed to have absolutely no advantage?, or is there something I can try to fix?

Thanks in advance

[attachment=25902:cuda_vs_matlab.jpg]

For performance analysis, you ought to do a warmp-up call before timing the function. That also should run a number of times and averaged.

For example.

% Benchmarking code for GPU

A =  gpuFunction( b_gpu);

tic

for i = 1 : NUM_RUNS

    A =  gpuFunction( b_gpu);

end

time = toc / NUM_RUNS

wakeuplola · February 16, 2011, 1:48pm

For performance analysis, you ought to do a warmp-up call before timing the function. That also should run a number of times and averaged.

For example.
% Benchmarking code for GPU

A =  gpuFunction( b_gpu);

tic

for i = 1 : NUM_RUNS

    A =  gpuFunction( b_gpu);

end

time = toc / NUM_RUNS

Thank you for your help. But this does not answer my question. I have implemented a MATLAB script using tic-toc to measure the performance of the function square_me in many versions. You can find the mex-cuda code on the MATLAB-CUDA White Paper. Nonetheless, in my case NATIVE MATLAB with double and single formats is faster than any mex-cuda implementation. Here you have some files (to run my test script) and a graphical comparison.

test_square_me_cuda.zip (22.6 KB)

Kwyjibo2010 · February 16, 2011, 6:39pm

Hi,

I have no developed on a QuadroFX3700M, but I have experience developing on other mobile chips and my general observation is, that the CPU tends to outperform the GPU on laptops. Give your code a try on a desktop system and post your results. It would also be helpful to post your compiler options.

Regards,
Kwyjibo

wakeuplola · February 17, 2011, 2:12pm

I have also a desktop HP computer. But when I run the speed_fft.m script(from the NVIDIA-Plugin, the CUDA improvement is not so impresive respect to MATLAB. Here you have the performance comparison graph.

My desktop has a PentiumÂ® Dual-Core CPU E5200 @2.50GHz, 2.98 GB- RAM, 32-bit Windows XP Operating System. As a GPU, I have a GeForce 9500 GT. I’m using MATLAB release 2009b. All CUDA files were compiled by Visual Studio 2005.

Details:

Driver version: 185.85

CUDA Cores: 32

Core clock: 550 MHz

Shader clock: 1350 MHz

Memory clock: 400 MHz (800 MHz)

Memory interface: 128-bit

Dedicated video memory: 1024 MB

Video BIOS version: 62.94.46.00.7A

IRQ: 16

Bus: PCI Express x16 Gen2

Thank you in advance

CudaDeveloper · February 18, 2011, 6:59pm

Kwyjibo,

I’m afraid you need to know a little more about how CUDA works. Lets look at your code to implement a square function on the GPU

/* Kernel to square elements of the array on the GPU */

global void square_elements(float* in, float* out, int N)

{

int idx = blockIdx.x*blockDim.x+threadIdx.x;

if ( idx < N) out[idx]=in[idx]*in[idx];

}

For every element of out, you are doing one multiply operation and two (possibly one) memory fetch. The following code may work faster:

/* Kernel to square elements of the array on the GPU */

global void square_elements(float* in, float* out, int N)

{

const int idx = blockIdx.x*blockDim.x+threadIdx.x;

if ( idx < N)

{

	const float x = in[idx];

	out[idx] = x * x;

}

}

Here we ensure there is only ever one memory fetch. If indeed there are two fetches here, this will double the speed of your kernel. This is because your kernel is memory bound.

Next look at how you are calling it. You are using 128 threads. On modern hardware, this is just far too few. Increase this to 192 or 256 and you’ll see again a big speed up due to more threads being allocated to each CUDA SM.

Those are the two easy fixes, and now for the slightly harder one. Use zero copy memory. Every source element is read once, and every destination element is written once. This is ideal for zero copy memory. Have a look in the NVIDIA programming guide about how exactly you need to do this. Basically you use cudaHostAlloc instead of malloc on the host side. You then call cudaGetDevicePointer to convert the host pointer to a device pointer you can pass to a GPU kernel. Immediately after the kernel call, call cudaThreadSynchronize to wait for the kernel to complete. Finally instead of the standard free function in C, you need to call cudaFreeHost to free up the memory allocated with cudaHostAlloc. Also at some point during the initialisation, you also need to call cudaSetDeviceFlags(cudaDeviceMapHost) to tell the cuda run time to map host memory into GPU memory space. Then remove any cudaMemcpy commands as you don’t need them any more.

Get this correct and you will double the speed of your program. However, do not allocate more memory than you have physical memory on the host.

Kind Regards,

Shane Cook.

www.cudadeveloper.com

Topic		Replies	Views
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3203	July 10, 2014
Comparing cuda fft and matlab fft CUDA Programming and Performance	5	6210	February 10, 2008
Windows 7 64bit, Matlab2009a/VS2008, CUDA is SLOW, WHY CUDA Programming and Performance	2	5643	January 1, 2010
little to none improvement with CUDA in Matlab CUDA produces little acceleration of FFT computation CUDA Programming and Performance	1	2199	March 15, 2010
MATLAB examples on Linux Speed does not seem to scale with graphics card capabilities CUDA Programming and Performance	2	4575	May 26, 2009
matlab cufft performance? CUDA Programming and Performance	11	8565	May 9, 2008
Increasing Speed of fft2_cuda I'm not getting the speeds I expected CUDA Programming and Performance	0	3401	July 9, 2008
Speed up Matlab code without changing it? CUDA Programming and Performance	2	3543	July 16, 2008
matlab2008a CUDA and ATI? can ATI accerlerating the progranmme in matlab2008? CUDA Programming and Performance	0	3840	October 24, 2009
speed_fft on a CUDA Programming and Performance	0	1846	June 18, 2009

CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP

Related topics