# Efficiency vs Maintainability parallel program

Hi I have a little doubt about what’s better, thinking about efficiency and maintainability of my code, what’s more important is the efficiency but I want to sacrifice the less in code maintainability.

I’m programming the distance transform using Euclidean distance, Cityblock distance, Chessboard distance.

I have the following code

``````for (idxI = 0; idxI < matrix->row; idxI++)

{

for (idxJ = 0; idxJ < matrix->col; idxJ++)

{

if ( distTransform->data[idxI*matrix->col+idxJ] == 0 )

{

for (idxK = 0; idxK < matrix->row; idxK++)

{

for (idxL = 0; idxL < matrix->col; idxL++)

{

switch (type) {

case CITYBLOCK:

distance = abs(idxK-idxI) + abs(idxL-idxJ);

break;

case CHESSBOARD:

distance = fmax(abs(idxK-idxI), abs(idxL-idxJ));

break;

case EUCLIDEAN:

distance = sqrt(pow(idxK-idxI, 2) + pow(idxL-idxJ,2));

break;

}

if ( distTransform->data[idxK*matrix->col+idxL] > distance )

distTransform->data[idxK*matrix->col+idxL] = distance;

}

}

}

}

}
``````

My problem is with the switch structure will it be less efficient? because if I don’t want to use the switch structure I will me duplicating code like a beast ! This is my secuencial code but after this little dilemma I will be programming the parallel version so first I need to solve this …

Any ideas?

Thanks !

If you have a Fermi card, that looks like a perfect application for function pointers.

2 things:

1. If don’t have a Fermi card? I have a GeForce GTX 460 what can I do to improve and avoid code duplication?

2. From a point of view of sequential algorithm that thing is efficient? I mean, if I use the swtich structure is a good practice?

The GTX 460 is a Fermi card, so you are on the safe side. On other cards, you might also use template programming to achieve the same.

By far the best optimization however would be to analytically identify the corner of the grid that is furthest away and eliminate the loops over [font=“Courier New”]IdxI[/font] and [font=“Courier New”]IdxJ[/font].

You have a Fermi card, then, so using function pointers (or templating as tera suggested).

No, it is terrible. Why would you ever evaluate a conditional statement O(N^4) times when it could be evaluated once?

A solution using template would likely be much faster than one using function pointers. The compiler would inline the code and would have a chance at unrolling some of your loops.

Thanks !

I will ask you, If you could explain me how to change my code I don’t have any idea of templates/function pointers :S I cannot see how can I avoid the switch structure without code duplication :S

Thanks again.

The nice thing about templates is that all the code duplication is done by the compiler for you.

The simple way to change your function to templates is just to make type a template parameter and leave the switch as is.

``````template<unsigned int type> __global__ void kernel(......)

{

....

switch (type) {

case CITYBLOCK:

distance = abs(idxK-idxI) + abs(idxL-idxJ);

break;

case CHESSBOARD:

distance = fmax(abs(idxK-idxI), abs(idxL-idxJ));

break;

case EUCLIDEAN:

distance = sqrt(pow(idxK-idxI, 2) + pow(idxL-idxJ,2));

break;

....

}
``````

The dead code optimizer will take care of removing the unused distance calculations.

A cleaner way to code it is to use functors for the distance calc. For that you write a class with a () operator that performs the comupuation and pass than it as a template param "template " … I think I remember seeing an example of this in the CUDA programming guide section on templates.

P.S. pow(value,2) is exceedingly slow compared to (value*value).

Ok I see, I checked the Programming Guide and of course there is something about Templates and Functors in CUDA.

Now, I think I should have said that in my sequential code I’m using C (not C++) so I think I cannot use templates, can I?Are templates specifically for C++?

Is there a way to do this (avoid of code duplication and optimizing the switch strucure) in a cleaner and efficient way in C (for sequential and parallel algorithm?

Thanks again !

Uff I see clearly now…now the question should be how can I (in the sequential algorithm) optimize without code duplication?

I mean I don’t want to have 4 for-loops for CITYBLOCK type, another 4 for-loops for CHESSBOARD and another 4 for-loops for EUCLIDEAN…That could be a way to avoid the evaluation of the switch O(N^4) but a LOT of code duplication…can you find a way nicier/cleaner/better

Cheers