You often see people claiming something like 300x speedup on some ports, but… thats not always totally honest. The CPU implementation is usually single-threaded and reasonably unoptimized, due to the implementor’s focus being on the GPU rather than creating an optimized CPU version.
If you want straight numbers to compare you can always use the Peak GFLOPs numbers, but in that case its important to note that the Peak GFLOPs for GPUs usually assume a MAD (multiply and add) instruction every cycle. Realistically, it might be half that if you only have A multiply or A add per cycle to do, or even worse if you look at transfer latencies or other overheads. On the other hand, due to caching and branch prediction and such, the CPU is more likely to run at its rated GFLOPs, at least on a single core. I am unsure if SSE is worked into the GFLOP rating for CPUs, but it is something to look at if you want a truly comparable benchmark.
In short, its really hard to compare one to the other, even when you assume embarrassingly parallel algorithms. Best indicator would be to try and see.
You often see people claiming something like 300x speedup on some ports, but… thats not always totally honest. The CPU implementation is usually single-threaded and reasonably unoptimized, due to the implementor’s focus being on the GPU rather than creating an optimized CPU version.
If you want straight numbers to compare you can always use the Peak GFLOPs numbers, but in that case its important to note that the Peak GFLOPs for GPUs usually assume a MAD (multiply and add) instruction every cycle. Realistically, it might be half that if you only have A multiply or A add per cycle to do, or even worse if you look at transfer latencies or other overheads. On the other hand, due to caching and branch prediction and such, the CPU is more likely to run at its rated GFLOPs, at least on a single core. I am unsure if SSE is worked into the GFLOP rating for CPUs, but it is something to look at if you want a truly comparable benchmark.
In short, its really hard to compare one to the other, even when you assume embarrassingly parallel algorithms. Best indicator would be to try and see.
I am one of those guilty of focusing more on the GPU side of things rather than optimally optimizing my ‘gold’ comparison methods :P So SSE is something I was unsure of. Forgive my ignorance about CPU specs.
I am one of those guilty of focusing more on the GPU side of things rather than optimally optimizing my ‘gold’ comparison methods :P So SSE is something I was unsure of. Forgive my ignorance about CPU specs.
In one of the latest issues of the German c’t magazine made a pretty critical review and benchmark of the Tesla C2050 double precision performance where they compared it with a Dual Xeon setup.
Tesla was outperforming Xeon substantially only for very carefully chosen problem (matrix) sizes.
For single precision they found that you better go with GTX 480 ;)
In one of the latest issues of the German c’t magazine made a pretty critical review and benchmark of the Tesla C2050 double precision performance where they compared it with a Dual Xeon setup.
Tesla was outperforming Xeon substantially only for very carefully chosen problem (matrix) sizes.
For single precision they found that you better go with GTX 480 ;)
The article is in German and not freely available. I don’t think the results come as a huge surprise to anyone familiar with the matter. But maybe some more public debunking the 100X GPU vs. CPU myth is still needed.
The article is in German and not freely available. I don’t think the results come as a huge surprise to anyone familiar with the matter. But maybe some more public debunking the 100X GPU vs. CPU myth is still needed.