GPU (Geforce 8400) is three times _slower_ than CPU while adding vectors. What am I doing wrong?

Hi!

I am experiencing subject problem on my machine.

Here is the code of the program: http://pastie.org/774734

I comment ‘Calculate on GPU’ section and uncomment ‘Calculate on CPU’ one to switch to CPU.

And results are incredible:

Using GPU:

$ time ./sandbox-opencl

real	0m2.431s

user	0m0.730s

sys	 0m1.600s

Using CPU:

$ time ./sandbox-opencl

real	0m0.755s

user	0m0.490s

sys	 0m0.260s

My videocard: nVidia GeForce 8400

My CPU: AMD Athlon™ 64 X2 Dual Core Processor 3800+

What am I doing wrong? How thousands of GPU cores can be slower than one CPU core?

Are you sure that your compiler doesn’t optimize your CPU calculation away? You do nothing with the result, even a cheap printf at the end could help preventing this.

Second, be sure not to comment our any of your memory transfers from/back to the GPU when running the CPU mode for comparison, as these are 128megs of data which could also be a performance hit

NilsS, thanks for a fast reply.

I edited code according to your notices.

New version: http://pastie.org/774750

Diff: http://pastie.org/774752

Benchmark results were changed slightly, but not significantly:

GPU:

real	0m2.434s

user	0m0.820s

sys	 0m1.530s

CPU:

real	0m0.975s

user	0m0.700s

sys	 0m0.260s

Depending on your compiler it could even be so intelligent to precompute the whole result.

I would try additionally:

int offset = rand();

Then initialize your vectors with i + offset;

This way you can be sure that your compiler doesn’t precompute anything and the whole code must be executed.

Thanks for the suggestion, but it did not affect results at all :-(

Can grid dimensions (1x1x8454144) be a problem?

Out of curiosity, does your opencl implementation allow these many work-items in the 1st dimension? 8454144. Shouldn’t this conform to the values obtained through CL_DEVICE_MAX_WORK_ITEM_SIZES? I have a max limit of 512 x 512 x 64 from CL_DEVICE_MAX_WORK_ITEM_SIZES

No, this is the number of work items within a work group, not the whole NDRange (grid). I believe there’s no way to query maximum size of NDRange.

Daniel, FYI there’s an error in your code that will manifest with newer driver versions:

cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, 0, 0, 0);

This should not work, you will be required to supply a valid platform id instead of 0.

Now, as for the source of your problems:

You have roughly 8 million floats to add. Two input vectors, one output, 16M reads and 8M writes - the amount of computation it’s negligible, we’re memory bound. 64MB of reads, 32MB of writes.

Your CPU realistic read/write RAM bandwidth is around 5GB/s. So the best time you get is simply 64MB/(5GB/s) + 32MB/(5GB/s) ~= 0,01875 s. You’re not getting anywhere near this probably because your loop is not unrolled and you’re not using SSE vector loads and stores. It’s a theoretical peak anyway, never mind that.

With the GPU, you need to copy the data to device (2GB/s if you’re lucky on your platform and without pinned memory), compute (6.4 GB/s peak bandiwdth on a 8400, 5GB/s realistic) and copy back (2GB/s). You can probably see the problem right here but let’s do the calculations anyway:

64MB/(2GB/s) + 96MB/(5GB/s) + 32MB/(2GB/s) ~= 0,065625 s

Analytical difference in performance is 3.5x. Your CPU should be that faster than your GPU in this task.

Vector addition is simply not a good problem for the GPU. You’re not exploiting neither its superior memory bandwidth (well, not really superior in the 8400 but generally), since most of the time is wasted on device<->host transfers, nor the compute throughput.

Big_Mac, thank you, it helps so much! Everything is clear now.