CUDA slower than MATLAB... again I can't get the simplest examples to show any speed-up using GP

Hello,

I have an HP (Laptop) with Intel® Core™2 Extreme CPU Q9300 @2.53GHz, 8 GB- RAM, 64-bit Windows 7 Operating System. As a GPU, I have a QUADRO FX 3700M. I’m using MATLAB release 2009a. All CUDA files were compiled by Visual Studio 2008.

Details:

Driver version: 261.28
CUDA Cores: 128
Core clock: 550 MHz
Shader clock: 1375 MHz
Memory clock: 799 MHz (1598 MHz)
Memory interface: 256-bit
Total available graphics: 4607 MB
Dedicated video memory: 1024 MB GDRD3
System video memory: 0 MB
shared system memory: 3583 MB
Video BIOS version: 62.92.51.00.05
IRQ: 16
Bus: PCI Express x16

My problem is the following: I implemented a 1-D FFT algorithm as programmed in the following plug-in

The algorithm works alright and the results are correct, but I compared the performance using MATLAB’s native fft alone and I get no advantage! I tried the same using the simplest examples like square_me.cu(from the white paper) and “speed_fft.m” (from the plugin) which are made precisely to show off the advantage of using the GPU.

Am I doomed to have absolutely no advantage?, or is there something I can try to fix?

Thanks in advance

For performance analysis, you ought to do a warmp-up call before timing the function. That also should run a number of times and averaged.

For example.

% Benchmarking code for GPU

A =  gpuFunction( b_gpu);

tic

for i = 1 : NUM_RUNS

    A =  gpuFunction( b_gpu);

end

time = toc / NUM_RUNS

Thank you for your help. But this does not answer my question. I have implemented a MATLAB script using tic-toc to measure the performance of the function square_me in many versions. You can find the mex-cuda code on the MATLAB-CUDA White Paper. Nonetheless, in my case NATIVE MATLAB with double and single formats is faster than any mex-cuda implementation. Here you have some files (to run my test script) and a graphical comparison.

test_square_me_cuda.zip (22.6 KB)

Hi,

I have no developed on a QuadroFX3700M, but I have experience developing on other mobile chips and my general observation is, that the CPU tends to outperform the GPU on laptops. Give your code a try on a desktop system and post your results. It would also be helpful to post your compiler options.

Regards,
Kwyjibo

I have also a desktop HP computer. But when I run the speed_fft.m script(from the NVIDIA-Plugin, the CUDA improvement is not so impresive respect to MATLAB. Here you have the performance comparison graph.

My desktop has a Pentium® Dual-Core CPU E5200 @2.50GHz, 2.98 GB- RAM, 32-bit Windows XP Operating System. As a GPU, I have a GeForce 9500 GT. I’m using MATLAB release 2009b. All CUDA files were compiled by Visual Studio 2005.

Details:

Driver version: 185.85

CUDA Cores: 32

Core clock: 550 MHz

Shader clock: 1350 MHz

Memory clock: 400 MHz (800 MHz)

Memory interface: 128-bit

Dedicated video memory: 1024 MB

Video BIOS version: 62.94.46.00.7A

IRQ: 16

Bus: PCI Express x16 Gen2

Thank you in advance

Kwyjibo,

I’m afraid you need to know a little more about how CUDA works. Lets look at your code to implement a square function on the GPU

/* Kernel to square elements of the array on the GPU */

global void square_elements(float* in, float* out, int N)

{

int idx = blockIdx.x*blockDim.x+threadIdx.x;

if ( idx < N) out[idx]=in[idx]*in[idx];

}

For every element of out, you are doing one multiply operation and two (possibly one) memory fetch. The following code may work faster:

/* Kernel to square elements of the array on the GPU */

global void square_elements(float* in, float* out, int N)

{

const int idx = blockIdx.x*blockDim.x+threadIdx.x;

if ( idx < N)

{

	const float x = in[idx];

	out[idx] = x * x;

}

}

Here we ensure there is only ever one memory fetch. If indeed there are two fetches here, this will double the speed of your kernel. This is because your kernel is memory bound.

Next look at how you are calling it. You are using 128 threads. On modern hardware, this is just far too few. Increase this to 192 or 256 and you’ll see again a big speed up due to more threads being allocated to each CUDA SM.

Those are the two easy fixes, and now for the slightly harder one. Use zero copy memory. Every source element is read once, and every destination element is written once. This is ideal for zero copy memory. Have a look in the NVIDIA programming guide about how exactly you need to do this. Basically you use cudaHostAlloc instead of malloc on the host side. You then call cudaGetDevicePointer to convert the host pointer to a device pointer you can pass to a GPU kernel. Immediately after the kernel call, call cudaThreadSynchronize to wait for the kernel to complete. Finally instead of the standard free function in C, you need to call cudaFreeHost to free up the memory allocated with cudaHostAlloc. Also at some point during the initialisation, you also need to call cudaSetDeviceFlags(cudaDeviceMapHost) to tell the cuda run time to map host memory into GPU memory space. Then remove any cudaMemcpy commands as you don’t need them any more.

Get this correct and you will double the speed of your program. However, do not allocate more memory than you have physical memory on the host.

Kind Regards,

Shane Cook.

www.cudadeveloper.com