my program is based on the cuda particles example,

demonstrated here http://www.youtube.com/watch?v=3gAJ0qUFPWg

code can be viewed here : http://docs.nvidia.com/cuda/cuda-samples/#particles

the programs target is to make spheres bounce on 3d model, composed of triangles mesh.

i have made few changes to the kernel, one of them is to add this function call and an if branch that reduce performance dramaticaly :

[s]Sum_Normals += collide_Surface(gridPos,pos, vel, d_Planes,d_Cell_Start_Index, d_Cell_Num_Planes, d_Cell_Planes); // ******

if ( (Sum_Normals.x != 0.0) | (Sum_Normals.y != 0.0) | (Sum_Normals.z != 0.0) )

{

Sum_Normals = normalize (Sum_Normals);

Dot_Product = dot (Sum_Normals, vel);

reflection = vel - 2*Dot_Product*Sum_Normals;

//force += reflection;

vel = -reflection*d_Params->boundaryDamping;

}

[/s]

this is how collide_Surface looks like:

[s]**device**

float3 collide_Surface( int3 gridPos,

float3 Pos,

float3 Vel,

float* d_Planes,

int* d_Cell_Start_Index,

int* d_Cell_Num_Planes,

int* d_Cell_Planes)

{

float* triverts;

float3 normal;

float3 Sum_Normals = make_float3(0.0);

int hash = calc_Grid_Hash(gridPos);

```
for (int i = 0; i < d_Cell_Num_Planes[hash]; i++)
{
triverts = &d_Planes [ 9*d_Cell_Planes[d_Cell_Start_Index[hash] +i] ];
if (triSphereOverlap(Pos, triverts))
{
normal = calcNormal (triverts);
Sum_Normals += normal;
}
}
return Sum_Normals;
```

}

[/s]

and this is how the function triSphereOverlap(), which operates from inside collideSurface(), looks like:

[s]**device**

bool triSphereOverlap (float3 pos, float* triverts)

{

bool result;

float aa,ab,ac,bb,bc,cc,rr, d, d1, d2, d3, e, e1, e2, e3;

float3 A,B,C,V, AB, BC, CA, Q1, Q2, Q3, QA, QB, QC;

```
// check distance from triangle plane to sphere center
A = make_float3(triverts[0] - pos.x, triverts[1] - pos.y, triverts[2] - pos.z); // Translate problem so sphere is centered at origin
B = make_float3(triverts[3] - pos.x, triverts[4] - pos.y, triverts[5] - pos.z);
C = make_float3(triverts[6] - pos.x, triverts[7] - pos.y, triverts[8] - pos.z);
rr = d_Params->Radius*d_Params->Radius;
V = cross (B-A,C-A);
d = dot (A,V);
e = dot (V,V);
// check distance from triangle vertices to sphere center
aa = dot (A,A);
ab = dot (A,B);
ac = dot (A,C);
bb = dot (B,B);
bc = dot (B,C);
cc = dot (C,C);
// check distance from triangle edges to sphere center
AB = B-A;
BC = C-B;
CA = A-C;
d1 = ab - aa;
d2 = bc - bb;
d3 = ac - cc;
e1 = dot (AB,AB);
e2 = dot (BC,BC);
e3 = dot (CA,CA);
Q1 = A*e1 - d1*AB;
Q2 = B*e2 - d2*BC;
Q3 = C*e3 - d3*CA;
QC = C*e1 - Q1;
QA = A*e2 - Q2;
QB = B*e3 - Q3;
result = ( (d*d > rr*e) |
( (aa > rr) & (ab > aa) & (ac > aa) ) |
( (bb > rr) & (ab > bb) & (bc > bb)) |
( (cc > rr) & (ac > cc) & (bc > cc) ) |
( (dot(Q1,Q1)>rr*e1*e1) & (dot(Q1,QC)>0) ) |
( (dot(Q2,Q2)>rr*e2*e2) & (dot(Q2,QA)>0) ) |
( (dot(Q3,Q3)>rr*e3*e3) & (dot(Q3,QB)>0) ) );
return result;
```

}[/s]

code clarification :

kernels originaly purpose is to calculate collisions for all particles with each other.

the space is devided to cells and every particle have one thread that calculates collisions against all other particles found in the same cell

the collide_Surface() target is to calculate collisions for all particles with surface as well.

each thread iterates over the triangles located in the same cell like the thread particle currently found in.

of course the number of iterations is different for every thread according to number of triangles found in its current cell.

the triSphereOverlap() purpose is to test weather the particle and the triangle are colliding.

my question is how can i improve those two functions and change it to be more cuda- compatible.

sorry for the lengthiness of my question, im pretty desperate since i got to submit this project in a few days, and those two functions, which are not that big, is the last significant hitch.