Can speed up ratio greater than the number of GPU processor cores?


Can speed up ratio greater than the number of GPU processor cores?

I read some paper,the speed up ratio is about 1000.

But the GPU them use containing about 100~200 processor cores.

Is it possible?

Because I think CPU is more powerful than GPU,

that is one CPU processor may be more powerful than one GPU pocessor core.

How can GPU speed up 1000 than CPU

     thanks everyone

Any speedup number can be achieved by slowing down the CPU version of the code.

Another reason for a high speed-up could be the use of special hardware units for certain instructions. A CPU might need several instructions to compute a cosine, while the GPU can do this in hardware.

Unless tons of researchers are lying, yes this can happen.

These sorts of figures come up because the CPU version of their code probably wasn’t optimized. There was actually quite a long discussion a few months ago on this forum. This page from Intel gives the theoretical performance (looks like single precision GFLOPS) for some of their processors. If you compare that to Wikipedia’s page on NVidia GPUs, you’ll see the theoretical speedup can be a bit over 20X for the GTX480. Anything faster than that is because the researcher didn’t optimize his CPU code and probably didn’t use multi-threading.

Actually, these numbers are for double precision, capping honest speedup values at about 10x (if SSE units can be put to good use). In (rare) cases where SSE is of no use, but the warp execution model is, honest speedups of 20x for double or 40x for single precision may be possible.

I personally think the fundamental flaw here is the implicit belief that the “speedup” for a given task is an intrinsic property of the hardware being compared. It isn’t. A speedup is a comparison between two software implementations and the hardware they run on. As a measure of GPU performance, speedup is useful if the CPU implementation you compare to is one which is well-established and/or generally used by your target audience, regardless of its optimization. Then you are working in units that your audience understands.

If the CPU implementation is something you made up just to be the denominator in a ratio, that’s silly and pointless. (and obviously results in nonsense measures, as evidenced by some papers) No one will ever use your CPU implementation, so why does it matter how much faster the GPU is than your CPU implementation? In this case, I’d rather researchers report an absolute unit of performance measure instead of wasting their time trying to write a good CPU implementation. Writing really fast numerical code for a CPU is hard (making this easier for some problems is a nice benefit of CUDA) and I don’t expect that a CUDA expert necessarily will be an expert in CPU optimization as well.

I completely agree with Seibert on this.
What matters to the readers, in my opinion, is : “I, along with most of the people doing kind of work X in field Y, am currently using product A to do the work. Would it work better with the implementation presented in this paper?”

I dont expect anyone to do the double work of rewriting a cpu version of the product as well as a new gpu version. SSE programming is, in my very limited experience, a nightmare and much more so than cuda programming is.

I think the focus should be more along these lines:
People are used to product A
If i feed the same input to research application B, and if both platforms output the same results then we can consider that they are both able to accomplish said task in a similar fashion, similar enough that you could switch from A to B with no problem. Given that this is the case, how much faster is B?

The matter is not to critique the “old” CPU version but to show the potential users how going from A to B will affect their work.

I see acceleration factors as an overall acceleration of a task and not specifically the GPU over CPU acceleration.

Now, for the second part of Seibert’s response: of course, if you have yourself written your own CPU version of what youre trying to accomplish, then of course it is useless.

So, i have no problem with 1000x or 10000x accelerations if, as i have established, for a given set of input and conditions the “new” program is indeed able to do the work 1000x or 10000x times faster than the established software. And it has nothing to do with intrinsic FLOPS capabilities of the machine.

Exactly. 1000x speedup over a very poorly written program that lots of people use is great! Report that result far and wide! It doesn’t matter whether you did it with a GPU or better CPU code.

Agreed. I’ve been writing code with liberal use of SSE intrinsics lately and it’s been quite difficult:

*need to painstakingly massage and permute the data into vector registers with

_mm_unpack, or _mm_shuffle

*specialization for different data types:

example: SSE2 has no uint8 compare less equal, so I had to do a monotonic conversion to int8 by adding -128 and using signed 8 bit compare:

aOffset = a - 128

bOffset = b - 128

return aOffset < bOffset || aOffset = bOffset

Luckily, I discovered later, a <= b == min(a, b) = a, which is more efficient than method above.

This is appalling. Why hasn’t a decent SSE capable compiler been released? I’ve tried GCC’s ftree-vectorize and all it can seem to do is SIMDize adding 2 arrays, a memset, or other trivial operations. It can’t do things like compacting elements, which I do painstakingly with _mm_shuffle. I heard Intel C++ compiler has better SIMDization, but haven’t tried it. But, I wouldn’t be surprised if it’s not decent - Intel’s idea of optimization is have an army of people doing manual optimization (datapaths and IPP libraries) instead of building better tools.

With this deficiency, I can’t see a good future for SSE. Currently, it seems GPUs have a 4x raw Gflop advantage over CPUs (same transistor budget). Since SSE is so clumsy to use, another 4x speedup is likely. This has to be why CUDA doesn’t use operations on vector registers and uses SIMT instead, though it’s a mirage performance wise (still better to have all threads run same instruction).

On the plus side, when I use SSE, I know the code is going to be as optimized as can be for 1 thread. For CUDA, there always seems to be more optimization opportunities (maybe just my inexperience).

Yes, if they were quantum computers, due to exponential state growth with respect to # qubits.

I think this is a good question. Aside from the contrived example, are there other examples of > linear speedup (besides cache effect)?

There’s the classic Lanchester Square law which I believe says for armies with shooting weapons, the enemy’s attrition rate
is (#soldiers)^2.

It would be great if there could be an analogy in || computing?

I had similar issues with -ftree-vectorize. I rewrote the core of one of our programs to share 80% of the code between the CPU and GPU implementation, and in the process eliminate all the C++ indirection that I thought was confusing the GCC vectorizer. Turns out that function calls seem to be pretty good at killing auto-vectorization (perhaps the inliner was not as aggressive as nvcc). Eventually I gave up because putting a CUDA device near the computer was easier than flogging GCC or dealing with very hard to read SSE code.

You said speedup is about 20X. But if my program is full of consine operation, use __cosf(). Can you or everyone estimate the speedup ratio? I am curious about that.