Polite ask for help with paralelling program using CUDA. Task: Rewrite given program so it will use

KillaPL · April 6, 2011, 9:23pm

WARNING

LARGE AMOUNT OF TEXT

Here is the problem:

I have written a program to simulate behavior of a certain framework in certain situation.

This is how it looks like before and after calculations. (this was like 4hours of counting for my pc).

Printing is done using OpenGL and this is not a part of problem.

While writing program i didnt know about CUDA technology and now i would like to use it to make it faster.

Problem is- I have a really hard time understanding whole thing and am searching for ANY help- tips, code written,

simply anything.

Informations:

Framework looks like in the picture( one in beginning of simulation, second after some time).

I am putting force on some nodes in the top and and is standing on something with bottom middle part, so these are

not calculated.

Framework looks like this:

class framework

{

public:	

	vector<vector<nod>> v;	//2-dimensional vector of nodes

	float k; //elasticity coefficient, needed for calculations

	float l; //standard distance between nodes, needed for calculations

	kratownica(void){};

	~kratownica(void){};

	kratownica(vector<vector<nod>> vv, float kk, float ll)

	{

		v = vv;

		k = kk;

		l = ll;

	}

};

Some math:

Each node is connected with some other nodes. Each of those nodes is trying to put calculated node in some distance from him.

In math words: force to neighbouring nodes is calculated as f = k(l - r), where k and l are from framework

and r is actual distance between nodes.

Considering this- every node is place in vector field implied by neighbouring nodes and in every position of this field

some force works on this node. So there is almost always possibility to find a better place, which means place with

weaker force.

Add some force from top in some top nodes and this is everything program does. I mean, finding better places for nodes

using newtons method.

Now for a code:

void move(framework &k, float f) //f is amount of force

{

	for(int i = 0; i < k.v.size(); i++)

	{

		for(int j = 0; j < k.v.size(); j++)

		{

			if(i == 0 && j >= (k.v.size() +1)/3 && j <= (2 * k.v.size() + 1) / 3 - 1)	//points with force added

				step(k, i, j, f);

			else if(i == k.v.size()-1 && j >= (k.v.size() +1)/3 && j <= (2 * k.v.size() + 1) / 3 - 1) //points not calculated

				true;

			else

				step(k, i, j, 0);	//rest

		}

	}

}

This is basically the main part that needs to get paralelled. I believe, as it is just 2 fors, it should be possible to do quite well.

What it does is of course finding better places for every node.

Now for function step, here comes the heavy math:

float step(framework &k, int i, int j, float f)	//moving node at coordinates i, j in framework k 

{

vector<float> r;	//vector of distances between given node and neighbours (or, to be exact, reverse of it)

vector<float> dx;	//vector of differences by x axis

vector<float> dy;	//vector of differences by y axis, all 3 needed for calculations

float l = k.l;	//standard distance between nodes

float kk = k.k;	//elasticity coefficient

nod aa = k.v[i][j]; // node we work at,

vector<nod> v1 = addnodes1(k, i, j);

vector<nod> v2 = addnodes2(k, i, j); //these are 2 vectors. One holds nodes neighbouring by straight lines (at top, at bottom, at left, at right),

				     //second one holds neighbours at top-left, top-right etc.

				     //they are separated becouse the second ones are at sqrt(2)*l standard distance, which must be noted in calculations.

for(int i = 0; i < v1.size(); i++)	//filling r, dx, dy with informations

{

	r.push_back(Q_rsqrt((aa.x - v1[i].x)*(aa.x - v1[i].x) + (aa.y - v1[i].y)*(aa.y - v1[i].y)));

	dx.push_back(aa.x - v1[i].x);

	dy.push_back(aa.y - v1[i].y);

}

for(int i = 0; i < v2.size(); i++)

{

	//r.push_back( odleglosc(aa, v2[i]));

	r.push_back(Q_rsqrt((aa.x - v2[i].x)*(aa.x - v2[i].x) + (aa.y - v2[i].y)*(aa.y - v2[i].y)));

	dx.push_back(aa.x - v2[i].x);

	dy.push_back(aa.y - v2[i].y);

}

//BEGINNING of counting force in place nod is now.

float Fx = 0;

float Fy = 0;

for(int i = 0; i < v1.size(); i++)

{

	Fx += (l * r[i] - 1) * dx[i];

	Fy += (l * r[i] - 1) * dy[i];

}

for(int i = v1.size(); i < v1.size() + v2.size(); i++)

{

	Fx += (1.4142 * l * r[i] - 1) * dx[i];

	Fy += (1.4142 * l * r[i] - 1) * dy[i];

}

	Fx *= -kk;

	Fy *= -kk;

	Fy += f;	//adding force from outside

// END of calculating force in place nod is now

	float norm = abs(Fx) + abs(Fy);		//norm of this place

	if(norm < 0.00001)	//if node is in place good enough, no need for further calculations

		return norm;

float a = 0;

float b = 0;

float c = 0;

/*

a, b and c is a matrix:

a b

b c

This matrix is a Jacobian of force in point nod is now- needed to use Newtons method

*/

for(int i = 0; i < v1.size(); i++)	//calculations of jacobian.

{

	a += 1 - l * r[i] + l * dx[i] * dx[i] * (r[i] * r[i] * r[i]);

	b += l * dx[i] * dy[i] * (r[i] * r[i] * r[i]);

	c += 1 - l * r[i] + l * dy[i] * dy[i] * (r[i] * r[i] * r[i]);

}

for(int i = v1.size(); i < v1.size() + v2.size() ; i++)

{

	a += 1 - 1.4142 * l * r[i] + 1.4142 * l * dx[i] * dx[i] * (r[i] * r[i] * r[i]);

	b += 1.4142 * l * dx[i] * dy[i] * (r[i] * r[i] * r[i]);

	c += 1 - 1.4142 * l * r[i] + 1.4142 * l * dy[i] * dy[i] * (r[i] * r[i] * r[i]);

}

	a *= kk;

	b *= kk;

	c *= kk;

float det = a*c - b*b;

float A1 = c/det;

float A2 = -b/det;

float A3 = a/det;

// A1, A2 i A3 is reversed JAcobian

k.v[i][j].x -= A1*Fx + A2*Fy;

k.v[i][j].y -= A2*Fx + A3*Fy;	// putting node to better place using newtons method.

return norm;	//returning old norm.

}

I guess this is everything needed.

Task here is to make calculations in “move” function parallel using CUDA and i am nicely asking for ANY help.

Code itself probably needs to be reorganised for this, but i dont really know how should i do it.

If anyone got throught this all, thanks in advance.

Also, sorry for possible bad english, its not my native.

Also, feel free to ask questions.

tera · April 6, 2011, 10:00pm

Looks like a perfect application for CUDA. Start by reading the CUDA C Programming Guide. It has all the information you need in an accessible way.

Your [font=“Courier New”]move()[/font] function will become a kernel, with the loops replaced by using the x- and y-coordinates of the thread in the grid. [font=“Courier New”]step()[/font] becomes a [font=“Courier New”]device[/font] function, and after adding some code to copy data to the GPU and back and turning your call to [font=“Courier New”]move()[/font] into a kernel invocation you are done.

KillaPL · April 6, 2011, 10:07pm

I already read it. Problem is: Compiler is crying about “step” being host function in global kernel. And I also have no clue where and when should I put which data.

tera · April 6, 2011, 10:17pm

Remember to either declare step() or at least have a forward reference before it’s use, otherwise the compiler will assume it’s a host function.

Your framework [font=“Courier New”]k[/font] apparently is the data that you need to allocate space for on the device, copy from the host to the device and back.

BTW: Code posted in the forum appears much more readable if enclosed within [[/i]code]…[[i]/code] tags. Would be nice if you could edit them into your original post.

KillaPL · April 6, 2011, 10:42pm

Could you be more specific about what should I do with step() function?

tera · April 6, 2011, 11:08pm

Just move it so that it appears before [font=“Courier New”]move()[/font] in the file. This way, by the time the compiler encounters the call to [font=“Courier New”]step()[/font] it will already know it is a [font=“Courier New”]device[/font] function, not a function on the host.

KillaPL · April 7, 2011, 5:44pm

Another problem came up. I am using a lot of vector operators. Whole framework is 2 dimensional vector, while calculating movement of nodes i use some vectors.

In every vector operation ( like .size(), .push_back() ) compiler reports error- cannot use host function in device kernel.

Do I have to rethink whole algorithm to not use vectors or is it just some easy removable error? Rewriting the code is generally possible, but i would prefer some more elegant solution.

tera · April 7, 2011, 6:01pm

Most of stdlib is not (yet?) implemented on device side, so unfortunately for now you will have to write your own code for that.

As CUDA is all about maximum performance (otherwise just run your code on the CPU…), you might soon find out that you have to think about what your code is doing at a lower level anyway.

KillaPL · April 7, 2011, 6:04pm

Challange accepted. I’ll post when it’s done, but this can now take some longer time. Thanks for help.

KillaPL · April 11, 2011, 4:18pm

So. I kinda remastered the hearth of whole simulation and now i have this function:

__device__ void movenod(framework &k, int i, int j, float f)

{

float r[3][3];	

float dx[3][3];

float dy[3][3];

float l = k.l;	

float kk = k.k;	

int size = k.rozmiar - 1;

int colbegin;

int colend;

int rawbegin;

int rawend;

/*

       Some instructions about initiating colbegin, colend etc.

*/

nod aa = k.v[i][j];

for(int m = rawbegin; m <= rawend; m++)

	for(int n = colbegin; n <= colend; n++)

		{

			if(!(m == 1 && n == 1))

				if(k.v[i+m-1][j+n-1].exist == 1)

				{

					dx[m][n] = (aa.x - k.v[i+m-1][j+n-1].x);

					dy[m][n] = aa.y - k.v[i+m-1][j+n-1].y;

					Q_rsqrt((dx[m][n])*(dx[m][n]) + (dy[m][n])*(dy[m][n]), r[m][n]);

				}

		}

float Fx = 0;

float Fy = 0;

if(!(i == size && j >= (size+2)/3 && j <= (2 * size + 3) / 3 - 1))

{

for(int m = rawbegin; m <= rawend; m++)

	for(int n = colbegin; n <= colend; n++)

	{

		if(!(m == 1 && n == 1))

			if(k.v[i+m-1][j+n-1].exist == 1)

		{

			float rr = sqrt((float)(abs(m-1) + abs(n-1))) * l * r[m][n];

			Fx += (rr - 1) * dx[m][n];

			Fy += (rr - 1) * dy[m][n];

		}

	}

	Fx *= -kk;

	Fy *= -kk;

	if(i == 0 && j >= (k.rozmiar +1)/3 && j <= (2 * k.rozmiar + 1) / 3 - 1)

		Fy += f;

}

	float norm = abs(Fx) + abs(Fy);	

	if(norm > 0.00001)	

{

float a = 0;

float b = 0;

float c = 0;

for(int m = rawbegin; m <= rawend; m++)

	for(int n = colbegin; n <= colend; n++)

		if(!(m == 1 && n == 1))

			if(k.v[i+m-1][j+n-1].exist == 1)

			{

				a += 1 - sqrt((float)(abs(m-1) + abs(n-1))) * l * r[m][n] * ( 1 + dx[m][n] * dx[m][n] * r[m][n] * r[m][n]);

				b += sqrt((float)(abs(m-1) + abs(n-1))) * l * dx[m][n] * dy[m][n] * (r[m][n] * r[m][n] * r[m][n]);

				c += 1 - sqrt((float)(abs(m-1) + abs(n-1))) * l * r[m][n] * ( 1 + dy[m][n] * dy[m][n] * r[m][n] * r[m][n]);

			}

	a *= -kk;

	b *= -kk;

	c *= -kk;

float det = a*c - b*b;

float A1 = c/det;

float A2 = -b/det;

float A3 = a/det;

k.v[i][j].x -= A1*Fx + A2*Fy;

k.v[i][j].y -= A2*Fx + A3*Fy;	

	}

}

And the thing is- the code works perfectly on a normal compiling. On Cuda it gives me errors:

nod aa = k.v[i][j]; - > operand of “*” must be a pointer

and “expression must have pointer type” everywhere i use framework k. (for example if(k.v[i+m-1][j+n-1].exist == 1)).

First question-> why is that?

Second question - > How should i do this?

KillaPL · April 12, 2011, 3:04pm

I’ve read some other topics, bot I’m not sure if I got it right- is the problem connected with 2 dimentional vector?

Lev · April 14, 2011, 6:24pm

Btw, are you sure you need

float norm = abs(Fx) + abs(Fy);

not

float norm = fabs(Fx) + fabs(Fy);

?

Concerning your question, you need allocate memory in Cuda by your self and select proper functions to allocate memory on gpu.

tera · April 17, 2011, 5:57pm

I’m not sure if Nvidia has come around implementing [font=“Courier New”]vector[/font] for device code yet. I certainly haven’t used it so far.

KillaPL · April 20, 2011, 8:19am

Allrighty, so vectors at all. I guess thread can be closed, im going to implement other, other, faster method now anyway. Thank you all for advices: )

By the way, method im going to implement is all about solving single matrix equation, so I’m still using cuda for it: ).

Topic		Replies	Views
Effective Parallelisation of CUDA C code CUDA Programming and Performance	38	2233	December 27, 2021
multi dimension array CUDA Programming and Performance	26	32941	February 12, 2010
Iteration help in CUDA CUDA Programming and Performance	11	7017	April 19, 2012
Sum vectors CUDA Programming and Performance	24	5310	December 21, 2011
problem running demos CUDA Programming and Performance	9	8294	January 1, 2009
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	14066	September 5, 2008
modifying simulator to work with CUDA.?. CUDA Programming and Performance	36	18744	April 15, 2009
An Easy Introduction to CUDA C and C++ Technical Blog	48	1601	July 19, 2018
An Even Easier Introduction to CUDA Technical Blog	143	7391	November 4, 2025
Square root problems It will not calculate correctly. CUDA Programming and Performance	15	8221	July 3, 2009

Polite ask for help with paralelling program using CUDA. Task: Rewrite given program so it will use

Related topics