I have a set of legacy C codes which perform some data-intensive tasks. I use these codes a lot.
Now with CUDA-enabled Graphics card at my disposal and the increasing time these tasks take due to my ever increasing database, I would like to use the power of parallel computing for reducing the running time of my codes.
However, I cannot even think of writing an equivalent set of CUDA codes for all the tasks, myself. So, I was wondering if there is a script/tool I can use which takes as input my serial C code and produces the parallel CUDA code as output.
I understand that such a script might not be able to provide me the speedups as compared to hand-pruned programs, but I am happy with moderate speedups as long as they can be achieved easily.
I searched a bit and found that there exist some directive based parallelization tools such as PGI Accelerator, CAPS HMPP, Goose and NOAA F2C Fortran/C to CUDA C auto-parallelisation compiler. But the drawback in all these tools is that I will have to understand the specific compiler directive syntax.
I have not seen such a solution. Converting serial C code to data parallel code is a very hard problem. icc and gcc have auto-vectorizers that can detect loops which can be parallelized and produce SSE instructions for them. CUDA is a much more difficult beast for an auto-vectorizer to efficiently target due to the complex cost tradeoffs of communication over the PCI-Express bus. You basically need some kind of directives to tell the compiler when the tradeoff is worth it, and how to intelligently move the data.
Can the most time intensive part of your program be isolated to one area? Incremental transformation might get you most of the benefit of CUDA.
There also is hiCUDA. And if you are happy with moderate speedups, you might also start with OpenMP for parallelization on the CPU, which has an execution model that is a lot easier to grasp. However, seconding Seibert, I don’t think you will get anywhere without at least learning the directive syntax of a parallelizing compiler.
I dont know if such a tool exists or not, but I definitely believe that it should be possible to come up with such a tool. Undoubtedly, it would have very high usability.
I know a lot of work has been (and is being) done in the area of automatic detection of parallelisable tasks in a serial C code.
Now with hiCUDA, it is also possible to automatically generate the required CUDA code.
So by using both these steps, a complete end-to-end solution should be possible.
I think it will be easier once the cycle of integration goes around again and it becomes standard for CPUs to be paired up with on-chip GPU-like cores. AMD Fusion, Intel’s Sandy Bridge, and in the future Project Denver do this, and hopefully auto-vectorizers will expand to utilize on-chip GPUs.
That said, my experiments with the current auto-vectorizers have all been disappointing. With the exception of very simple and carefully written loops, they usually fail to do anything. It is much more productive still to identify the key hotspot in my code and re-express it as an explicitly parallel calculation.
My experience with auto-parallelizing compilers is several years old, but to me they seemed basically useless for the same reason.
On the other hand, learning just a few directives really should not be that difficult, and my experience with OpenMP is very positive. If hiCUDA or any of the commercial compilers is able to get anywhere near that, that should be a very useful tool.
Well yes, when I think like you people(Programmers), probably even I will prefer ‘just’ reading a few directives and add them to the given program and get a bit of control over the output parallel code and get as much speedup as I can.
But, when I think like a person, with no knowledge about parallel architectures and even no interest in it, but just a curiosity to be able to utilize his otherwise idle processors to gain speedup, I believe auto-parallelizers are just the right tool for me.
Having said that, I of course lack the experience that you guys have.
You say that your experience with such tools hasnt been very satisfactory, which means there have been attempts at building such tools. What do you think they lagged? Why they couldnt be used? I cant understand why they arent so famous?
Could you please share with me the link to some of these?
As someone whose job is not professional programming, I and my colleagues would be much better served by languages that work at a higher level and make the parallelism explicit. When the problem domain permits it, using libraries like numpy and PyCUDA both make my code simple to read and make it easier for the code to be parallelized. I have colleagues who work almost exclusively in Matlab, which also can be easily transformed into GPU code using tools like AccelerEyes Jacket. C and C++ tend to encourage a low-level perspective that obscures the data relations and the properties of functions.
However, given the substantial size of existing C++ code I have to work with, I would welcome some automated parallelization tools. I’m just not going to hold my breath waiting for them. :)
The challenge is in trying to infer the presence or absence of side-effects and dependencies between iterations of a loop. In the case where you have simple operations, this is not too hard:
for (int i=0; i < n; i++)
c[i] = a[i] + b[i];
But suppose you do something like this:
for (int i=0; i < n; i++)
c[i] = foo(a[i]) + foo(b[i]);
Looking just at this section, the compiler doesn’t know if foo has side-effects that would make parallelization produce incorrect results. If the implementation of foo is available and inline-able, then the compiler can decide how to procede. If the function is in a different compilation unit, then the compiler can’t do anything until the linker runs, and then only if the object files contain sufficient annotations to allow post-compilation inlining. I believe the Intel compiler can do this, which is one of its performance edges over gcc. (Although gcc may also be developing this capability.)
However, once you start throwing in things like C++ and virtual functions, this quickly becomes a terrible problem to do correctly. I’m sure experts in this know of all sorts of other tricky corner cases.
Even worse, sometimes the ability to parallelize depends on the algorithm chosen by the programmer. Some algorithms just can’t be parallelized, but by changing the approach (and sometimes the data structure) you can fix that.
An example is this implementation of a prefix sum:
prefix_sum = a;
for (int i=1; i < n; i++)
prefix_sum[i] = a[i] + prefix_sum[i-1];
Every iteration of this loop depends on the previous one. Nevertheless, you can do this efficiently in parallel by using the associativity of addition and a pairwise summing scheme. (See Prefix sum - Wikipedia for details.) I don’t expect any compiler to be able to figure out tricks like this in general. There is just a lot of code out there you cannot run in parallel because no one thought about it that way at the time.
I strongly suspect that your boss considers you to be an excellent automatic parallelisation tool External Image
Getting back towards the original question, I believe that some autoparallelisers work by inserting OpenMP directives into the code. If a routine isn’t getting a good speed up, you can have a look at the OpenMP stuff, and work out how it’s screwing up (usually as you say, some loop-carried dependency). I understand that the PGI CUDA compiler is supposed to work in a similar way - giving you feedback on why parallelisation failed.
But in general, you’re right. Much better languages are needed for scientific computing, since C/C++, Fortran and the like are too low level. As a speaker at SC10 said “The moment you let a scientist write ‘DO i=1,10’, you’re dead.”