I was wondering about the development effort (or at least the perceived development effort) of parallel/heterogeneous programming. That is, say one was developing a function without using higher-level building blocks. If the naive serial implementation requires X hours of development time, how long would it take to develop a optimized version, a parallel version (using, say, OpenMP, MPI, TBB, Cilk++ etc.) and a GPU accelerated version? Of course I know this depends quite a lot on the kind of function being implemented, the programmer’s proficiency etc. So I could perhaps rather ask how much more effort it is given the programmer is proficient.
Most studies seem to use SLOC as indication of programming effort but that doesn’t quite seem right. The above is also not a very exact question as I have not specified how proficient the programmer is, what proficient even means or what domain of problems he is working … It is more just a general question (curiosity) as to how much effort people feel it is to develop these different versions, than a scientific study :)
I can’t give a quantitative answer, but certainly for most data-parallel problems, I find making the GPU optimized version quicker than an optimized CPU version. A mediocre CPU version is pretty easy, and is nice to compare with the GPU, but optimizing for the CPU seems to be much trickier.
Development time is inversely proportional to the ingenuity of the designer, availablity of debug tools
directly (and hence adversely) proportional to the laziness of the programmer
I’d agree with seibert statements above: I too find creating optimized GPU versions of codes I’m dealing with (mostly various types of numerical solvers) much easier to come with than what is the case with the CPU versions. As far as various parallel programming APIs concerned: here I still find MPI most preferable (very mature, very comprehensive), and then CUDA - I find all of these multi-processor programming APIs (like OpenMP, or TBB, etc.) rather cumbrersome to use (most of the time, I see no advantage of these over plain POSIX threads); but of course, APIs mentioned are targeting completely different parallel architectures, so it’s rather hard to provide any kind of meaningful comparison beyond personal opinions.