300x to 600x times faster... really?

I believe their 47000x speedup is to actually show that their algorithm scales up with the number of threads, which actually is an important factor.
Your program may not be very well optimalised, but if you get two times speedup when you double the number of active threads, than it is an algorithm worth checking out.

Personally I would hover compare the number of Stream Multiprocessors being used and the speedup gained by using more of them. Making your algorithm scale well with the number of SM is an act of creativity, but getting extra 50% by getting rid of few instructions from your implementation that just engineering. In the future, if the hardware changes (while remaining a its parallel paradigm), it is the algorithm that will matter, not the implementation.

I am curious what people here would think about a hypothetical CUDA compiler that could target CPUs as well as GPUs with similar efficiency as a solution to this problem. I personally believe that even this would not be a fair comparison because an optimized implementation would use different algorithms at the source level for each architecture, but it would at least remove one dimension (different source level implementations for each architecture).

Who is talking about CUDA targetting CPUs?

Hey,

I should say I am also sailing on the same boat. I have ported a Molecular Dynamics application and am seeing around 250X speed up on GTX 280. I am not convinced as from where do I get this number.
For comparison, yes I use a single core CPU implementation but it is optimised to my knowledge, although it does not use SSE. I did not write the CPU version.

I really want to characterise this speed up number. I found out that my application is memory bound and am using around 250 MB of memory. Thus, was wondering can the difference in the memory bandwidth on CPU and GPU explain some of the speed up. But I am not entirely sure of this. Need inputs.

I also want to know whether the Global memory bandwidth reported by cudaProf is the actual bandwidth that my application is achieving and how to find the bandwidth on the CPU which would be a fair comparison to it. How can I know shared memory bandwidth on the GPU.

Is there any model which exists for performance characterisation on GPU?

Memory bandwidth is huge, absolutely. I’m getting what appears on the surface to be a super-linear speedup in a CFD/linear algebra type code, but any realistic-sized problem is heavily limited by memory. I personally just plot relative speedup vs data size to quantify it.

If you’re truly memory bandwidth bound you should only be able to get 2-3x speedup for a Tesla C1060 or GTX 280/285 compared to a Nehalem. See for example the STREAM memory benchmarks at:

[url=“http://www.advancedclustering.com/company-blog/stream-benchmarking.html”]http://www.advancedclustering.com/company-...nchmarking.html[/url]

I mean as a basis for comparing two equivalent implementations of the same program, one on a GPU and one on a CPU. Sorry if that wasn’t clear.

Thanks for the reply mlohry. I want to know how can I empirically show the difference in the Bandwidth that my application uses on both CPU and GPU.

Maybe i’m understanding this incorrectly… if you’re compute bound, you’re reaching a reasonable fraction of the theoretical gflop, to get 2 orders of magnitude speedup would mean you were reaching less than 1% on CPU and near 100% on GPU?

does anyone have any legitimate examples of this? (or even a reasonable hypothetical scenario)

Lets say you do something like taking the sin and cos of a linearly interpolated texture lookup

__sincosf(tex2D(tex, x, y));

And you know you don’t need any more accuracy than the hardware units provide. The speedup of this on a gpu over even a well optimized cpu code would be quite substantial. This is probably a pretty made up use scenario though.

Vow! That qualifies for quote for the day… Finally good sense has started to prevail in the forum… (me included - thanks to vvolkov).

It implies that if a single threaded CPU implementation takes 1000s to complete, a the GPU takes 10s. And nothing more than that. And when I say compute bound, I meant to say the optimal GPU implementation is compute bound, and nothing more than that.

I’d argue that a single threaded, non-SSE implementation for the CPU is NOT optimized well, regardless of how smart the code is or how tricky assembly tricks are used.

Yes, this has been mentioned in a research paper and was rumored to be slated for a release of nvcc, but never happened. The research paper described a series of source transformations that turn your CUDA kernel (and associated host calls) into something that OpenMP and auto-vectorizing compilers could optimize. Sure, it won’t be the fastest you could possibly achieve on the CPU, but it would probably be better than a lot of people could do. (I imagine there are a few algorithms where the CPU performance would be dismal with a direct source transformation.)

Honestly, I like CUDA a lot for expressing a data-parallel computation, and would really appreciate a compiler that would turn such code into something that actually makes use of the SSE units and many cores every one of my workstations have. In 5 years, I expect that OpenCL will fill this niche, but right now it sounds like performance is not up to CUDA standards yet.

Something like that (I’ve heard the rumours too…) would be immensely useful. Since just about every machine purchased today is multicore, having such a back end would be a big ‘win’ for CUDA (or OpenCL). Although the speed up might not be optimal, being able to use multiple core would give some speed up (remember most scientific codes are written by scientists, and hence probably not optimal even in series).

Related to the last point and benchmarking… what do you do if the person doing the CUDA port didn’t write the serial version? This is what has happened to me. Now, I did fix up the slight oversight of using BLAS to multiply around a million 2x2 matrices together, but I haven’t taken the effort to purge the code of array[i][j][k] statements (let alone putting in SSE, with which I have no experience). Does that make my benchmarks unfair? Remember, I’m paid to write the CUDA code, not fix the CPU code.

The original paper was done by John Stratton at UIUC. Even though NVIDIA has not released a version of nvcc based on this, it seems like the research project at least is still going on. There is a paper that was just announced at CGO that I think is based on the same work :

John Stratton, Vinod Grover, Jaydeep Marathe, Baastian Aarts, Mike Murphy, Ziang Hu and Wen-mei Hwu. Efficient Compilation of Fine-grained SPMD-threaded Programs for Multicore CPUs

Even though the text of the paper hasn’t been released yet, it is done by the same people and the title suggest that the goal of compiling CUDA to Multicore CPUs is the same.

I agree. I am personally not very fond of the current high level languages and threading libraries for multi-core applications. MPI is my second favorite behind CUDA and even MPI programs sometimes require a bit of tuning to get them to scale to a larger machine. I think that it would be very useful to be able to use something like CUDA for generic application development.

If such a tool were to ever be released, how efficient would it have to be for people to use it for both GPU and CPU implementations of the same program. For example, if the CPU version was 5x/50x/500x slower on average than the GPU version, would you end up using this rather than writing two different versions?

The relevant metric isn’t to the GPU version, but to the original serial CPU code. If such a compiler lets me write CUDA code for a CPU and get (say) 3x speed up on 4 cores as compared to the original serial code, that’s a big win if the user doesn’t have a hefty GPU in their machine. Perhaps an optimised parallel CPU code would get perfect scaling, but if that doesn’t currently exist… Furthermore, you would still have all the maintenance fun of two separate code bases. For machines like ORNL’s upcoming GPU cluster, it doesn’t make sense. But for a code to be distributed to lots of users, most of whom are not terrifically competent with computers (their expertise is elsewhere) and who look blankly at you when you say ‘GPU,’ it’s a significant win.

I agree with YDD on this. I have to write code that runs on a wide variety of platforms for people working on my experiment. When it is important, I can make sure the code runs on a system equipped with GTX 295s, but for testing and small jobs, it needs to also run on the CPU. If a CUDA-to-CPU compiler could even hit half the speed of a well-tuned multicore/SSE implementation, I would be ecstatic. That would easily be an improvement over any CPU code I would write, and would actually encourage me to port more parts of our application to CUDA. Then, when a GPU is present, I could decide which kernels would benefit from running on the GPU, given the latency and bandwidth limits of PCI-Express, and let the rest run on the CPU. (This is why I would love to see a hybrid computer where the CUDA device communicates with the CPU over Hypertransport. The threshold for GPU-worthy kernels would drop way down.)

Thanks seibert and YDD for the comments.

Good to know. I’ll post a more detailed response on dec 15th.

How about CUDA-to-Larrabee compiler? That would be not only very useful but also hilarious knowing how much NVIDIA and Intel like each other :)