Efficiency vs Maintainability parallel program

BRabbit27 · April 6, 2011, 8:03pm

Hi I have a little doubt about what’s better, thinking about efficiency and maintainability of my code, what’s more important is the efficiency but I want to sacrifice the less in code maintainability.

I’m programming the distance transform using Euclidean distance, Cityblock distance, Chessboard distance.

I have the following code

for (idxI = 0; idxI < matrix->row; idxI++)

	{

		for (idxJ = 0; idxJ < matrix->col; idxJ++)

		{

			if ( distTransform->data[idxI*matrix->col+idxJ] == 0 )

			{

				for (idxK = 0; idxK < matrix->row; idxK++)

				{

					for (idxL = 0; idxL < matrix->col; idxL++)

					{

						switch (type) {

							case CITYBLOCK:

								distance = abs(idxK-idxI) + abs(idxL-idxJ);

								break;

							case CHESSBOARD:

								distance = fmax(abs(idxK-idxI), abs(idxL-idxJ));

								break;

							case EUCLIDEAN:

								distance = sqrt(pow(idxK-idxI, 2) + pow(idxL-idxJ,2));

								break;

						}

						if ( distTransform->data[idxK*matrix->col+idxL] > distance )

							distTransform->data[idxK*matrix->col+idxL] = distance;

					}

				}

			}

		}

	}

My problem is with the switch structure will it be less efficient? because if I don’t want to use the switch structure I will me duplicating code like a beast ! This is my secuencial code but after this little dilemma I will be programming the parallel version so first I need to solve this …

Any ideas?

Thanks !

avidday · April 6, 2011, 8:08pm

If you have a Fermi card, that looks like a perfect application for function pointers.

BRabbit27 · April 6, 2011, 10:15pm

Thanks for the answer !

2 things:

If don’t have a Fermi card? I have a GeForce GTX 460 what can I do to improve and avoid code duplication?
From a point of view of sequential algorithm that thing is efficient? I mean, if I use the swtich structure is a good practice?

tera · April 6, 2011, 11:14pm

The GTX 460 is a Fermi card, so you are on the safe side. On other cards, you might also use template programming to achieve the same.

By far the best optimization however would be to analytically identify the corner of the grid that is furthest away and eliminate the loops over [font=“Courier New”]IdxI[/font] and [font=“Courier New”]IdxJ[/font].

avidday · April 7, 2011, 7:07am

You have a Fermi card, then, so using function pointers (or templating as tera suggested).

No, it is terrible. Why would you ever evaluate a conditional statement O(N^4) times when it could be evaluated once?

DrAnderson42 · April 7, 2011, 9:17am

A solution using template would likely be much faster than one using function pointers. The compiler would inline the code and would have a chance at unrolling some of your loops.

BRabbit27 · April 7, 2011, 12:09pm

Thanks !

I will ask you, If you could explain me how to change my code I don’t have any idea of templates/function pointers :S I cannot see how can I avoid the switch structure without code duplication :S

Thanks again.

DrAnderson42 · April 7, 2011, 12:15pm

The nice thing about templates is that all the code duplication is done by the compiler for you.

The simple way to change your function to templates is just to make type a template parameter and leave the switch as is.

template<unsigned int type> __global__ void kernel(......)

    {

    ....

                                                    switch (type) {

                                                        case CITYBLOCK:

                                                                distance = abs(idxK-idxI) + abs(idxL-idxJ);

                                                                break;

                                                        case CHESSBOARD:

                                                                distance = fmax(abs(idxK-idxI), abs(idxL-idxJ));

                                                                break;

                                                        case EUCLIDEAN:

                                                                distance = sqrt(pow(idxK-idxI, 2) + pow(idxL-idxJ,2));

                                                                break;

    ....

    }

The dead code optimizer will take care of removing the unused distance calculations.

A cleaner way to code it is to use functors for the distance calc. For that you write a class with a () operator that performs the comupuation and pass than it as a template param “template ” … I think I remember seeing an example of this in the CUDA programming guide section on templates.

P.S. pow(value,2) is exceedingly slow compared to (value*value).

BRabbit27 · April 7, 2011, 2:41pm

The nice thing about templates is that all the code duplication is done by the compiler for you.

The simple way to change your function to templates is just to make type a template parameter and leave the switch as is.
template<unsigned int type> __global__ void kernel(......)

    {

    ....

                                                    switch (type) {

                                                        case CITYBLOCK:

                                                                distance = abs(idxK-idxI) + abs(idxL-idxJ);

                                                                break;

                                                        case CHESSBOARD:

                                                                distance = fmax(abs(idxK-idxI), abs(idxL-idxJ));

                                                                break;

                                                        case EUCLIDEAN:

                                                                distance = sqrt(pow(idxK-idxI, 2) + pow(idxL-idxJ,2));

                                                                break;

    ....

    }
The dead code optimizer will take care of removing the unused distance calculations.

A cleaner way to code it is to use functors for the distance calc. For that you write a class with a () operator that performs the comupuation and pass than it as a template param “template ” … I think I remember seeing an example of this in the CUDA programming guide section on templates.

P.S. pow(value,2) is exceedingly slow compared to (value*value).

Ok I see, I checked the Programming Guide and of course there is something about Templates and Functors in CUDA.

Now, I think I should have said that in my sequential code I’m using C (not C++) so I think I cannot use templates, can I?Are templates specifically for C++?

Is there a way to do this (avoid of code duplication and optimizing the switch strucure) in a cleaner and efficient way in C (for sequential and parallel algorithm?

Thanks again !

BRabbit27 · April 7, 2011, 2:51pm

Uff I see clearly now…now the question should be how can I (in the sequential algorithm) optimize without code duplication?

I mean I don’t want to have 4 for-loops for CITYBLOCK type, another 4 for-loops for CHESSBOARD and another 4 for-loops for EUCLIDEAN…That could be a way to avoid the evaluation of the switch O(N^4) but a LOT of code duplication…can you find a way nicier/cleaner/better

Cheers