I’d like to get your general oppnion of a contradiction I am getting to.
From one hand CUDA is declared to be a high-level C language that supports loops, structures, branches, so documentation looks like it’s really easy to write software for GPU.
However, if you look at the reduction example doc, especially at the last step of optimization that really delivers code faster then optimized CPU code you will realise that:
you shouldn’t really use loops
you shuoldn’t really use branching
you shuoldn’t really use complex data structures because memory access then will be non-coalesced
you should always operate with data aligned on at least 16 bytes or even powers of 2, so processing general arrays, like 17x35 is a nightmare
optimized CUDA code (like “reduction” sample step 7) more looks like specialized hardware configuration scripts (more like FPGA programming), rather than really C.
The questions are:
why there was a need to support all these loops and branching constructs in language if their usage is prohibitive for performance?
so far GPU wins in terms of cost of hardware relative to performance. How about the cost of man*hours to write really fast code on CUDA that is “hard to get it right” (I cite the reduction doc) relative to the similar cost of getting the fast code using Intel MKL library for instance or optimizing Intel C++ compiler on CPU? Can you share your experience here?
No problems with loops or branching. With complex structure and memory, we see the limits of the GPU, BUT:
GPU does not replace CPU: For some kind of algorithms (for example Monte-Carlo) you get a factor of performance= rouhly the number of processor per card, so 128 for a 8800GTX. In fact I have more 180 than 128… So with a 9800 GX2 you can get roughly a factor 600. So we are speaking of two different worlds and with GPU you can have an acces to a world you can not imagine with CPU( tell me the factor of your optimized code in C#). GPU have extraordinary qualities , only useful for some algorithm. If GPU gives you just a facor 2 or 5, I agree with you it is not necessary a good choice, considering the constraints and the cost to program. BUT, when you have a factot 350(I have 2GTX) … it is fabulous… :D :D :D
Because sometimes, branching a bit in a CUDA kernel allows you to to do straight-up high-speed processing in the rest of the kernel. In previous GPGPU paradigms, you’d have to pull the data back to the host, process a bit, and send it back to finish. The whole bus latencies and bottlenecks thing is much slower than a couple of not-so-fast branch statements.
Basically, the slow branching stuff doesn’t add speed by itself, but it adds enough flexibility to make the whole processing chain go fast.
Indeed, I use loops, branching, make uncoalesced reads, and have a complex struct in my kernel. I have broken pretty much every performance “rule” there is. However, I do all of these things in a way that limits their performance penalty in my kernel, so the GPU is still enormously faster than the CPU version of my code.
I think about it this way: The point of using CUDA is not to 100% utilize every chip resource with maximum efficiency. The point of CUDA is to solve your computing problem as fast as is practical. If doing inefficient things occasionally allows your algorithm to do more work on the GPU, then it can still result in a net win.
(I should mention: loops and branching are not inherently a problem. The thread grid can make looping unnecessary, and frequent branching which splits warps leaves stream processors underused. However both constructs are very useful in general in CUDA.)
Comparing fpga development in VHDL or Verilog to CUDA sounds like a fairly absurd comparison. There is really no reason that loops should automatically sacrifice performance. In fact, loops can lead to dramatic increases in performance for some algorithms.
Indeed CUDA is ment for HPC at a cost of a regular PC. It is to try things that can’t be reached otherwise. :) It is the same as with the restaurants “fast, tasty, cheap,- pick any two options”.
On the serious side, the whole approach of the micro-parallelism is a very new branch (for wide use I mean, in academia it was for ages). There are simply not enough tools to make it right. Compilers are not smart enough to fix programmer’s errors and offer an outer layer of abstraction.
The same thing is with conventional CPUs. For many scientific tasks you can easily get 10x if you do it in assembler, rather than C. But here you can get up to 100x (for each GPU).
Alex, have you really been able to get a 10x improvement with assembler over C? What compiler are you comparing against? In general, hand optimized IA-32 code should be inferior to the output of a decent compiler.
It was many people’s work, a large project. The task was to implement efficiently the stack decoding alg. over a huge graph. The task also included neural nets of peculiar config. for score estimation. And finally with all bells and whistles it went ~ 10x compared to the initial “raw” implementation in the C/C++. The main trick was going from floating point to integers (partially) and implementing this in a clever way through assembler pieces. The results were faster, less reliable, but still admissible.
Though, it is not a human vs. compiler comparison, but an indication, that with less abstraction and still knowing the properties of the task “at large” you actually have an ability to make it faster even on conventional CPUs.
That’s the truth I am talking about. You can get x10 speedup factor in CPU by applying SSE, multithreading (at multicores), optimizing cache hits, going to bits from large arrays of byte-per-boolean etc.
Usually GPU code works 10 times faster than CPU version because CUDA constrains you from the beginning, such as you can’t insert “new” operator into the kernel and do some obviously suboptimal things.
Try investing the same time you spend with CUDA on just optimizing your existing C/C++ code ;-) you might get the same x10 speedup
But, if you target fo x100 speedup for a complex algorithm (Monte Carlo was nice example, but too simple) you will have to pass all the way shown in reduction sample from SDK.
You can’t really talk of speedup per GPU because they can not communicate with each other , so you get the speedup proportional to the number of GPUs only if your task can be divided into a set of independed subtasks.
Totally agree with you, except Monte carlo is too simple. . . External Image
If you need to make a good number of paths, with for exemple
5 factors of aleas you can not really work with only a factor 10: it is
too long!
With a factor 200 per GPU it is possible.
The monte-carlo example given is very simple… but don’t worry
a monte-carlo can be very complex, and the GPU is totaly adapted for
that.
So GPU give huge possibilities for some areas of scientific calculation, and
we should not forget another area: video games! :D
Regards.
Just to add my (poor) contribution, using a 8800GTX, my algorithm is up to 400 times faster using CUDA than a similar C implementation. And I think I can do better.
Well, yes. But heavy threads will fit into a particular CPU but absolutely not into another. Just imagine what can possibly happen if you’re optimize for cache hits on quad core and then “some stupid user” will run it on dual-core… Otherwise if you write (& optimize) 4 separate threads what is going to happen on 8-core CPU?
Core load balancing is VERY tricky for heavy threads. Thus, NVIDIA proposes a kind of new approach to multi-threading - find low-level parallelism in your tasks, make huge ammount of independent (at least partially) threads.