Automatic Parallelization of C code

fundoonick · March 14, 2011, 8:29pm

Hello everyone,

I have a set of legacy C codes which perform some data-intensive tasks. I use these codes a lot.

Now with CUDA-enabled Graphics card at my disposal and the increasing time these tasks take due to my ever increasing database, I would like to use the power of parallel computing for reducing the running time of my codes.
However, I cannot even think of writing an equivalent set of CUDA codes for all the tasks, myself. So, I was wondering if there is a script/tool I can use which takes as input my serial C code and produces the parallel CUDA code as output.
I understand that such a script might not be able to provide me the speedups as compared to hand-pruned programs, but I am happy with moderate speedups as long as they can be achieved easily.

I searched a bit and found that there exist some directive based parallelization tools such as PGI Accelerator, CAPS HMPP, Goose and NOAA F2C Fortran/C to CUDA C auto-parallelisation compiler. But the drawback in all these tools is that I will have to understand the specific compiler directive syntax.

Is there a simple solution available?

seibert · March 14, 2011, 10:02pm

I have not seen such a solution. Converting serial C code to data parallel code is a very hard problem. icc and gcc have auto-vectorizers that can detect loops which can be parallelized and produce SSE instructions for them. CUDA is a much more difficult beast for an auto-vectorizer to efficiently target due to the complex cost tradeoffs of communication over the PCI-Express bus. You basically need some kind of directives to tell the compiler when the tradeoff is worth it, and how to intelligently move the data.

Can the most time intensive part of your program be isolated to one area? Incremental transformation might get you most of the benefit of CUDA.

tera · March 14, 2011, 10:40pm

There also is hiCUDA. And if you are happy with moderate speedups, you might also start with OpenMP for parallelization on the CPU, which has an execution model that is a lot easier to grasp. However, seconding Seibert, I don’t think you will get anywhere without at least learning the directive syntax of a parallelizing compiler.

akhil89 · March 15, 2011, 12:52pm

I dont know if such a tool exists or not, but I definitely believe that it should be possible to come up with such a tool. Undoubtedly, it would have very high usability.

I know a lot of work has been (and is being) done in the area of automatic detection of parallelisable tasks in a serial C code.
Now with hiCUDA, it is also possible to automatically generate the required CUDA code.

So by using both these steps, a complete end-to-end solution should be possible.

seibert · March 15, 2011, 2:30pm

I think it will be easier once the cycle of integration goes around again and it becomes standard for CPUs to be paired up with on-chip GPU-like cores. AMD Fusion, Intel’s Sandy Bridge, and in the future Project Denver do this, and hopefully auto-vectorizers will expand to utilize on-chip GPUs.

That said, my experiments with the current auto-vectorizers have all been disappointing. With the exception of very simple and carefully written loops, they usually fail to do anything. It is much more productive still to identify the key hotspot in my code and re-express it as an explicitly parallel calculation.

tera · March 15, 2011, 5:15pm

My experience with auto-parallelizing compilers is several years old, but to me they seemed basically useless for the same reason.

On the other hand, learning just a few directives really should not be that difficult, and my experience with OpenMP is very positive. If hiCUDA or any of the commercial compilers is able to get anywhere near that, that should be a very useful tool.

fundoonick · March 16, 2011, 7:48pm

Well yes, when I think like you people(Programmers), probably even I will prefer ‘just’ reading a few directives and add them to the given program and get a bit of control over the output parallel code and get as much speedup as I can.

But, when I think like a person, with no knowledge about parallel architectures and even no interest in it, but just a curiosity to be able to utilize his otherwise idle processors to gain speedup, I believe auto-parallelizers are just the right tool for me.
Having said that, I of course lack the experience that you guys have.

You say that your experience with such tools hasnt been very satisfactory, which means there have been attempts at building such tools. What do you think they lagged? Why they couldnt be used? I cant understand why they arent so famous?

Could you please share with me the link to some of these?

tera · March 16, 2011, 9:38pm

Of course it would be more desirable if programs wrote themselves and programmers weren’t necessary. Problem just is, current technology isn’t there yet.

One of these parallelizing compilers that you can even try for free is Intel’s compiler suite.

seibert · March 17, 2011, 3:34am

As someone whose job is not professional programming, I and my colleagues would be much better served by languages that work at a higher level and make the parallelism explicit. When the problem domain permits it, using libraries like numpy and PyCUDA both make my code simple to read and make it easier for the code to be parallelized. I have colleagues who work almost exclusively in Matlab, which also can be easily transformed into GPU code using tools like AccelerEyes Jacket. C and C++ tend to encourage a low-level perspective that obscures the data relations and the properties of functions.

However, given the substantial size of existing C++ code I have to work with, I would welcome some automated parallelization tools. I’m just not going to hold my breath waiting for them. :)

The challenge is in trying to infer the presence or absence of side-effects and dependencies between iterations of a loop. In the case where you have simple operations, this is not too hard:

for (int i=0; i < n; i++)

{

  c[i] = a[i] + b[i];

}

But suppose you do something like this:

for (int i=0; i < n; i++)

{

  c[i] = foo(a[i]) + foo(b[i]);

}

Looking just at this section, the compiler doesn’t know if foo has side-effects that would make parallelization produce incorrect results. If the implementation of foo is available and inline-able, then the compiler can decide how to procede. If the function is in a different compilation unit, then the compiler can’t do anything until the linker runs, and then only if the object files contain sufficient annotations to allow post-compilation inlining. I believe the Intel compiler can do this, which is one of its performance edges over gcc. (Although gcc may also be developing this capability.)

However, once you start throwing in things like C++ and virtual functions, this quickly becomes a terrible problem to do correctly. I’m sure experts in this know of all sorts of other tricky corner cases.

Even worse, sometimes the ability to parallelize depends on the algorithm chosen by the programmer. Some algorithms just can’t be parallelized, but by changing the approach (and sometimes the data structure) you can fix that.

An example is this implementation of a prefix sum:

prefix_sum[0] = a[0];

for (int i=1; i < n; i++)

{

  prefix_sum[i] = a[i] + prefix_sum[i-1];

}

Every iteration of this loop depends on the previous one. Nevertheless, you can do this efficiently in parallel by using the associativity of addition and a pairwise summing scheme. (See Prefix sum - Wikipedia for details.) I don’t expect any compiler to be able to figure out tricks like this in general. There is just a lot of code out there you cannot run in parallel because no one thought about it that way at the time.

The gcc autovectorizer is described on this page:

http://gcc.gnu.org/projects/tree-ssa/vectorization.html

YDD · March 17, 2011, 1:33pm

I strongly suspect that your boss considers you to be an excellent automatic parallelisation tool External Image

Getting back towards the original question, I believe that some autoparallelisers work by inserting OpenMP directives into the code. If a routine isn’t getting a good speed up, you can have a look at the OpenMP stuff, and work out how it’s screwing up (usually as you say, some loop-carried dependency). I understand that the PGI CUDA compiler is supposed to work in a similar way - giving you feedback on why parallelisation failed.

But in general, you’re right. Much better languages are needed for scientific computing, since C/C++, Fortran and the like are too low level. As a speaker at SC10 said “The moment you let a scientist write ‘DO i=1,10’, you’re dead.”

Sarnath · March 19, 2011, 3:11am

Good one…This one made my morning today. Worth reading the thread till the end… (No offence Seibert… Just that I liked that joke.)

ajo · March 30, 2011, 4:08pm

I used an interactive parallelization tool by Vector Fabrics that does exactly that: it takes my sequential C program and shows my

opportunities to parallelize. The side effects Seibert refers to in his post are detected and displayed on the screen.

It works by actually profiling your program as it runs. You can play with parallelization scenarios, play with the number of threads,

see the cache behavior and memory bandwidth loading, see run times change when you move code sections from one CPU type to another, etc.

Although they do not currently seem to target GPUs directly (they target POSIX parallel threads), their tools are still very useful to

find out how to transform sequential programs to make them run on parallel hardware.

There are some slides at http://flexware.elis.ugent.be/system/files/gent-symposium-stravers.pdf

Just my 2 cents… ! External Image

seibert · March 30, 2011, 4:51pm

Neat! Looks like a slick program. I’ll have to try it out. :)

Sarnath · April 1, 2011, 9:41am

Man… This is something I have always wanted (to use or develop)… Cool! Thanks for the info…

Topic		Replies	Views
Change source code? How exactly... CUDA Programming and Performance	6	2559	October 3, 2008
auto converter Serial C Code to Parallel GPU Code CUDA Programming and Performance	1	855	May 11, 2016
Building a library for automatic parallel computing with CUDA CUDA Programming and Performance	5	856	January 23, 2011
Translating FORTRAN to C++ to CUDA advice CUDA Programming and Performance	19	23481	February 1, 2010
Can PGI compiler automatically extract paralellism? Legacy PGI Compilers	2	2279	July 4, 2012
Suggestion for CUDA CUDA Programming and Performance	7	4369	April 29, 2008
What do I change CUDA Programming and Performance	0	968	December 7, 2008
how to convert c to cuda? source include function CUDA Programming and Performance	2	1724	November 2, 2013
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13975	May 29, 2021
Convincing skeptical bigwigs on the future of CUDA CUDA Programming and Performance	49	9200	March 19, 2009

Automatic Parallelization of C code

Related topics