 # Polite ask for help with paralelling program using CUDA. Task: Rewrite given program so it will use

WARNING

LARGE AMOUNT OF TEXT

Here is the problem:

I have written a program to simulate behavior of a certain framework in certain situation.

http://img17.imageshack.us/g/clipcp.jpg/

This is how it looks like before and after calculations. (this was like 4hours of counting for my pc).

Printing is done using OpenGL and this is not a part of problem.

While writing program i didnt know about CUDA technology and now i would like to use it to make it faster.

Problem is- I have a really hard time understanding whole thing and am searching for ANY help- tips, code written,

simply anything.

Informations:

Framework looks like in the picture( one in beginning of simulation, second after some time).

I am putting force on some nodes in the top and and is standing on something with bottom middle part, so these are

not calculated.

Framework looks like this:

``````class framework

{

public:

vector<vector<nod>> v;	//2-dimensional vector of nodes

float k; //elasticity coefficient, needed for calculations

float l; //standard distance between nodes, needed for calculations

kratownica(void){};

~kratownica(void){};

kratownica(vector<vector<nod>> vv, float kk, float ll)

{

v = vv;

k = kk;

l = ll;

}

};
``````

Some math:

Each node is connected with some other nodes. Each of those nodes is trying to put calculated node in some distance from him.

In math words: force to neighbouring nodes is calculated as f = k(l - r), where k and l are from framework

and r is actual distance between nodes.

Considering this- every node is place in vector field implied by neighbouring nodes and in every position of this field

some force works on this node. So there is almost always possibility to find a better place, which means place with

weaker force.

Add some force from top in some top nodes and this is everything program does. I mean, finding better places for nodes

using newtons method.

Now for a code:

``````void move(framework &k, float f) //f is amount of force

{

for(int i = 0; i < k.v.size(); i++)

{

for(int j = 0; j < k.v.size(); j++)

{

if(i == 0 && j >= (k.v.size() +1)/3 && j <= (2 * k.v.size() + 1) / 3 - 1)	//points with force added

step(k, i, j, f);

else if(i == k.v.size()-1 && j >= (k.v.size() +1)/3 && j <= (2 * k.v.size() + 1) / 3 - 1) //points not calculated

true;

else

step(k, i, j, 0);	//rest

}

}

}
``````

This is basically the main part that needs to get paralelled. I believe, as it is just 2 fors, it should be possible to do quite well.

What it does is of course finding better places for every node.

Now for function step, here comes the heavy math:

``````float step(framework &k, int i, int j, float f)	//moving node at coordinates i, j in framework k

{

vector<float> r;	//vector of distances between given node and neighbours (or, to be exact, reverse of it)

vector<float> dx;	//vector of differences by x axis

vector<float> dy;	//vector of differences by y axis, all 3 needed for calculations

float l = k.l;	//standard distance between nodes

float kk = k.k;	//elasticity coefficient

nod aa = k.v[i][j]; // node we work at,

vector<nod> v1 = addnodes1(k, i, j);

vector<nod> v2 = addnodes2(k, i, j); //these are 2 vectors. One holds nodes neighbouring by straight lines (at top, at bottom, at left, at right),

//second one holds neighbours at top-left, top-right etc.

//they are separated becouse the second ones are at sqrt(2)*l standard distance, which must be noted in calculations.

for(int i = 0; i < v1.size(); i++)	//filling r, dx, dy with informations

{

r.push_back(Q_rsqrt((aa.x - v1[i].x)*(aa.x - v1[i].x) + (aa.y - v1[i].y)*(aa.y - v1[i].y)));

dx.push_back(aa.x - v1[i].x);

dy.push_back(aa.y - v1[i].y);

}

for(int i = 0; i < v2.size(); i++)

{

//r.push_back( odleglosc(aa, v2[i]));

r.push_back(Q_rsqrt((aa.x - v2[i].x)*(aa.x - v2[i].x) + (aa.y - v2[i].y)*(aa.y - v2[i].y)));

dx.push_back(aa.x - v2[i].x);

dy.push_back(aa.y - v2[i].y);

}

//BEGINNING of counting force in place nod is now.

float Fx = 0;

float Fy = 0;

for(int i = 0; i < v1.size(); i++)

{

Fx += (l * r[i] - 1) * dx[i];

Fy += (l * r[i] - 1) * dy[i];

}

for(int i = v1.size(); i < v1.size() + v2.size(); i++)

{

Fx += (1.4142 * l * r[i] - 1) * dx[i];

Fy += (1.4142 * l * r[i] - 1) * dy[i];

}

Fx *= -kk;

Fy *= -kk;

Fy += f;	//adding force from outside

// END of calculating force in place nod is now

float norm = abs(Fx) + abs(Fy);		//norm of this place

if(norm < 0.00001)	//if node is in place good enough, no need for further calculations

return norm;

float a = 0;

float b = 0;

float c = 0;

/*

a, b and c is a matrix:

a b

b c

This matrix is a Jacobian of force in point nod is now- needed to use Newtons method

*/

for(int i = 0; i < v1.size(); i++)	//calculations of jacobian.

{

a += 1 - l * r[i] + l * dx[i] * dx[i] * (r[i] * r[i] * r[i]);

b += l * dx[i] * dy[i] * (r[i] * r[i] * r[i]);

c += 1 - l * r[i] + l * dy[i] * dy[i] * (r[i] * r[i] * r[i]);

}

for(int i = v1.size(); i < v1.size() + v2.size() ; i++)

{

a += 1 - 1.4142 * l * r[i] + 1.4142 * l * dx[i] * dx[i] * (r[i] * r[i] * r[i]);

b += 1.4142 * l * dx[i] * dy[i] * (r[i] * r[i] * r[i]);

c += 1 - 1.4142 * l * r[i] + 1.4142 * l * dy[i] * dy[i] * (r[i] * r[i] * r[i]);

}

a *= kk;

b *= kk;

c *= kk;

float det = a*c - b*b;

float A1 = c/det;

float A2 = -b/det;

float A3 = a/det;

// A1, A2 i A3 is reversed JAcobian

k.v[i][j].x -= A1*Fx + A2*Fy;

k.v[i][j].y -= A2*Fx + A3*Fy;	// putting node to better place using newtons method.

return norm;	//returning old norm.

}
``````

I guess this is everything needed.

Task here is to make calculations in “move” function parallel using CUDA and i am nicely asking for ANY help.

Code itself probably needs to be reorganised for this, but i dont really know how should i do it.

If anyone got throught this all, thanks in advance.

Also, sorry for possible bad english, its not my native.

Also, feel free to ask questions.

Looks like a perfect application for CUDA. Start by reading the CUDA C Programming Guide. It has all the information you need in an accessible way.

Your [font=“Courier New”]move()[/font] function will become a kernel, with the loops replaced by using the x- and y-coordinates of the thread in the grid. [font=“Courier New”]step()[/font] becomes a [font=“Courier New”]device[/font] function, and after adding some code to copy data to the GPU and back and turning your call to [font=“Courier New”]move()[/font] into a kernel invocation you are done.

I already read it. Problem is: Compiler is crying about “step” being host function in global kernel. And I also have no clue where and when should I put which data.

Remember to either declare step() or at least have a forward reference before it’s use, otherwise the compiler will assume it’s a host function.

Your framework [font=“Courier New”]k[/font] apparently is the data that you need to allocate space for on the device, copy from the host to the device and back.

BTW: Code posted in the forum appears much more readable if enclosed within [code]…[/code] tags. Would be nice if you could edit them into your original post.

Could you be more specific about what should I do with step() function?

Just move it so that it appears before [font=“Courier New”]move()[/font] in the file. This way, by the time the compiler encounters the call to [font=“Courier New”]step()[/font] it will already know it is a [font=“Courier New”]device[/font] function, not a function on the host.

Another problem came up. I am using a lot of vector operators. Whole framework is 2 dimensional vector, while calculating movement of nodes i use some vectors.

In every vector operation ( like .size(), .push_back() ) compiler reports error- cannot use host function in device kernel.

Do I have to rethink whole algorithm to not use vectors or is it just some easy removable error? Rewriting the code is generally possible, but i would prefer some more elegant solution.

Most of stdlib is not (yet?) implemented on device side, so unfortunately for now you will have to write your own code for that.

As CUDA is all about maximum performance (otherwise just run your code on the CPU…), you might soon find out that you have to think about what your code is doing at a lower level anyway.

Challange accepted. I’ll post when it’s done, but this can now take some longer time. Thanks for help.

So. I kinda remastered the hearth of whole simulation and now i have this function:

``````__device__ void movenod(framework &k, int i, int j, float f)

{

float r;

float dx;

float dy;

float l = k.l;

float kk = k.k;

int size = k.rozmiar - 1;

int colbegin;

int colend;

int rawbegin;

int rawend;

/*

Some instructions about initiating colbegin, colend etc.

*/

nod aa = k.v[i][j];

for(int m = rawbegin; m <= rawend; m++)

for(int n = colbegin; n <= colend; n++)

{

if(!(m == 1 && n == 1))

if(k.v[i+m-1][j+n-1].exist == 1)

{

dx[m][n] = (aa.x - k.v[i+m-1][j+n-1].x);

dy[m][n] = aa.y - k.v[i+m-1][j+n-1].y;

Q_rsqrt((dx[m][n])*(dx[m][n]) + (dy[m][n])*(dy[m][n]), r[m][n]);

}

}

float Fx = 0;

float Fy = 0;

if(!(i == size && j >= (size+2)/3 && j <= (2 * size + 3) / 3 - 1))

{

for(int m = rawbegin; m <= rawend; m++)

for(int n = colbegin; n <= colend; n++)

{

if(!(m == 1 && n == 1))

if(k.v[i+m-1][j+n-1].exist == 1)

{

float rr = sqrt((float)(abs(m-1) + abs(n-1))) * l * r[m][n];

Fx += (rr - 1) * dx[m][n];

Fy += (rr - 1) * dy[m][n];

}

}

Fx *= -kk;

Fy *= -kk;

if(i == 0 && j >= (k.rozmiar +1)/3 && j <= (2 * k.rozmiar + 1) / 3 - 1)

Fy += f;

}

float norm = abs(Fx) + abs(Fy);

if(norm > 0.00001)

{

float a = 0;

float b = 0;

float c = 0;

for(int m = rawbegin; m <= rawend; m++)

for(int n = colbegin; n <= colend; n++)

if(!(m == 1 && n == 1))

if(k.v[i+m-1][j+n-1].exist == 1)

{

a += 1 - sqrt((float)(abs(m-1) + abs(n-1))) * l * r[m][n] * ( 1 + dx[m][n] * dx[m][n] * r[m][n] * r[m][n]);

b += sqrt((float)(abs(m-1) + abs(n-1))) * l * dx[m][n] * dy[m][n] * (r[m][n] * r[m][n] * r[m][n]);

c += 1 - sqrt((float)(abs(m-1) + abs(n-1))) * l * r[m][n] * ( 1 + dy[m][n] * dy[m][n] * r[m][n] * r[m][n]);

}

a *= -kk;

b *= -kk;

c *= -kk;

float det = a*c - b*b;

float A1 = c/det;

float A2 = -b/det;

float A3 = a/det;

k.v[i][j].x -= A1*Fx + A2*Fy;

k.v[i][j].y -= A2*Fx + A3*Fy;

}

}
``````

And the thing is- the code works perfectly on a normal compiling. On Cuda it gives me errors:

nod aa = k.v[i][j]; - > operand of “*” must be a pointer

and “expression must have pointer type” everywhere i use framework k. (for example if(k.v[i+m-1][j+n-1].exist == 1)).

First question-> why is that?

Second question - > How should i do this?

I’ve read some other topics, bot I’m not sure if I got it right- is the problem connected with 2 dimentional vector?

Btw, are you sure you need

float norm = abs(Fx) + abs(Fy);

not

float norm = fabs(Fx) + fabs(Fy);

?

Concerning your question, you need allocate memory in Cuda by your self and select proper functions to allocate memory on gpu.

I’m not sure if Nvidia has come around implementing [font=“Courier New”]vector[/font] for device code yet. I certainly haven’t used it so far.

Allrighty, so vectors at all. I guess thread can be closed, im going to implement other, other, faster method now anyway. Thank you all for advices: )

By the way, method im going to implement is all about solving single matrix equation, so I’m still using cuda for it: ).