GPU vs CPU performance comparison

I am trying to compare the performance of a single GPU core with a singe CPU core. Using a simple code

float x=0.0;

 for (int i=0; i<100000; i++) {

   x = sin(x+0.1*i);

 }

I am getting the result that Intel Xeon CPU can do this job 10x faster than NVIDIA GeForce GTX 295. Does it make sense? Has anybody tried to make a similar comparison?

Hmm, I don’t see how would you trivially parallelize the problem, as the value of each iteration depends on the previous one. How did you implement it on GPU? Or did you simply launched the same kernel with (1,1,1) (1,1,1) grid and block 100000 times with incremented argument?

Thank you for your reply!

Yes, you are right, I simply launched one kernel with (1,1,1) (1,1,1) grid as you described. As I said, I wanted to compare one single GPU core with one CPU core. I could not find any comparison like this on the forum. I am wondering if anybody else did similar tests.

Actually, it would be interesting to compare the performance of a single core in different graphics card. I don’t know how much they differ. Does performance scale simply with clock frequency or there is something else to consider. For instance, I would like to know how different CPUs are in GeForce GTX 295 and Tesla C1060. The price difference is significant between two cards, but how the price scales with CPU performance?

Thank you again.

Even that won’t compare a single “core” vs “core” as you would effectively only be using an eight of the resources if not 1/32th.

It makes absolutely no sense to compare CPU to GPU in sequential, single core algorithms as this is the exact opposite of what the GPU is supposed to be doing. It’s not designed for this.

It’s as if you’d launched ten million pthreads on a single-core CPU to see how it compares with ten millions threads launched on CUDA.

I am aware of that. Well, the reason of my interest is simple. Some problems cannot be vectorized, so they are not appropriate for GPU. Therefore, I would like to get some feeling what part of a program to execute in CPU and what is better for GPU. With dual Intel Xeon four core CPUs I have 8 extremely powerful cores in total. GPU offers 240 (less powerful?) cores. The best way to find a solution is to try and experiment. But it is also not bad to have some general feeling where “I gain” and where “I lose”.

There are two basic performance dimensions for CUDA: compute speed and memory bandwidth. Compute speed is proportional to [# of stream processors] * [shader clock rate]. Memory bandwidth is proportional to [width of bus] * [memory clock]. (Assuming you are comparing cards with the same memory technology, like GDDR3.)

Most algorithms can be classified as either compute bound or memory bound, and memory bound algorithms are far more common than you might expect. Supplying 240 stream processors with data can quickly eat up all the memory bandwidth available on the device, even with an enormously large 512-bit wide bus.

The price difference between Tesla and GeForce is entirely a function of target market and quality control. Performance-wise, the C1060 is very similar to a GTX 285, although the C1060 has 4 GB of device memory instead of 1 GB. Additionally, the Tesla is tested for 24/7 operation in a cluster, whereas the GeForce cards are not.

That’s not to say the GeForce cards don’t work for serious computation. Many of us do long calculations on GeForce cards, but we take the risk of early card death. I’ve worked with 9 GeForce GPUs over the last 2 years, and only one failed (a GTX 280). That’s a completely unscientific number since each card was exposed to different workloads and environments (the failed 280 definitely did not get the same cooling as some of the other cards). I would love to know the failure rates of Teslas installed in compute clusters, but so far have not seen anyone publish that.

At this point, some people are puzzled: “$1000 extra for that?” If you fall into that category, the GeForce line is for you. :)

I think the warning Big_Mac is trying to convey is that running 1 block with 1 thread, the GPU is in a highly non-linear regime. You can’t scale directly from a 1 thread benchmark to the expected running time of N threads. The behavior of a GPU does not scale the same way as a multicore CPU.

And yes, each stream processor is less powerful than a Xeon core, in part due to the lower shader clock (~1.2 GHz in the GTX 295 I think).

A more direct comparison between the CPU and the GPU is to think of GTX 295 as two separate 30 “core” processors clocked at 1.2 GHz. Each core executes a 32-wide SIMD instruction in 4 clocks. On the CPU side, you have two separate 4 core CPUs clocked at 2-3 GHz. Each CPU core can execute a 4-wide SIMD (SSE) instruction per clock (true?). At a hardware level, this is much more of an apples-to-apples comparison.

The CUDA software model allows you to program a SIMD device without writing out explicit SIMD instructions (instead you create thousands of threads). This tends to lead people assume that a stream processor in CUDA is the functional equivalent of a CPU core, which is not the case.

Thank you, guys, for your replies and for useful information. I think you answered my questions.

As a next step, I submitted several jobs for execution in GPU and in one CPU core of Intel Xeon. GPU runs the jobs in parallel, whereas CPU runs the jobs in series. The code is very similar to the one I posted above:

float x = 0.1;

  for (int i=0; i<10; i++) {

	for (int j=0; j<18000; j++) {

	  x = sin(x + 0.1*i);

	}

  }

The initial conditions (x) are different for different jobs.

The picture attached shows the execution time vs. the number of jobs (i.e. the number of threads in GPU or number of sequential computations in CPU). Thus, even a “budget” GPU outpaces one Xeon CPU core at 15 jobs in this specific case. It was interesting to see how flat the line for GPU is at the beginning, where the resources “underloaded”.

What I am doing here is probably a nonsense for you guys, but I think that now I have better ideas where GPUs can be efficient.

The hardware I used in this test:

CPU: Intel Xeon E5405@2.00GHz

GPU: Quadro NVS 290
time.png

Your code is not parallelizable. If you run the iterations of the loop concurrently, you’re getting different results and you have a race condition.

I am sorry. Probably I was not clear or I misunderstand your comment. The code I wrote above is, basically, my kernel which I submit for execution. Every kernel executes same number of iterations (10x18000). The only difference between kernels is the initial value of x (which I calculate basing on thread_id, block_id, etc.).