branch divergence : help improve my code

my program is based on the cuda particles example,
demonstrated here http://www.youtube.com/watch?v=3gAJ0qUFPWg
code can be viewed here : http://docs.nvidia.com/cuda/cuda-samples/#particles

the programs target is to make spheres bounce on 3d model, composed of triangles mesh.

i have made few changes to the kernel, one of them is to add this function call and an if branch that reduce performance dramaticaly :

[s]Sum_Normals += collide_Surface(gridPos,pos, vel, d_Planes,d_Cell_Start_Index, d_Cell_Num_Planes, d_Cell_Planes); // ******
if ( (Sum_Normals.x != 0.0) | (Sum_Normals.y != 0.0) | (Sum_Normals.z != 0.0) )
{
Sum_Normals = normalize (Sum_Normals);
Dot_Product = dot (Sum_Normals, vel);
reflection = vel - 2Dot_ProductSum_Normals;
//force += reflection;
vel = -reflection*d_Params->boundaryDamping;
}

[/s]

this is how collide_Surface looks like:

[s]device
float3 collide_Surface( int3 gridPos,
float3 Pos,
float3 Vel,
float* d_Planes,
int* d_Cell_Start_Index,
int* d_Cell_Num_Planes,
int* d_Cell_Planes)
{
float* triverts;
float3 normal;
float3 Sum_Normals = make_float3(0.0);
int hash = calc_Grid_Hash(gridPos);

for (int i = 0; i < d_Cell_Num_Planes[hash]; i++)
{
	triverts = &d_Planes [ 9*d_Cell_Planes[d_Cell_Start_Index[hash] +i] ];
	if (triSphereOverlap(Pos, triverts))
	{
		normal = calcNormal (triverts);
		Sum_Normals += normal;
	}
}
return Sum_Normals;

}

[/s]

and this is how the function triSphereOverlap(), which operates from inside collideSurface(), looks like:

[s]device
bool triSphereOverlap (float3 pos, float* triverts)
{
bool result;
float aa,ab,ac,bb,bc,cc,rr, d, d1, d2, d3, e, e1, e2, e3;
float3 A,B,C,V, AB, BC, CA, Q1, Q2, Q3, QA, QB, QC;

// check distance from triangle plane to sphere center
A = make_float3(triverts[0] - pos.x, triverts[1] - pos.y, triverts[2] - pos.z);			//  Translate problem so sphere is centered at origin
B = make_float3(triverts[3] - pos.x, triverts[4] - pos.y, triverts[5] - pos.z);
C = make_float3(triverts[6] - pos.x, triverts[7] - pos.y, triverts[8] - pos.z);
rr = d_Params->Radius*d_Params->Radius;
V = cross (B-A,C-A);
d = dot (A,V);
e = dot (V,V);

// check distance from triangle vertices to sphere center
aa = dot (A,A);
ab = dot (A,B);
ac = dot (A,C);
bb = dot (B,B);
bc = dot (B,C);
cc = dot (C,C);


// check distance from triangle edges to sphere center
AB = B-A;
BC = C-B;
CA = A-C;
d1 = ab - aa;
d2 = bc - bb;
d3 = ac - cc;
e1 = dot (AB,AB);
e2 = dot (BC,BC);
e3 = dot (CA,CA);
Q1 = A*e1 - d1*AB;
Q2 = B*e2 - d2*BC;
Q3 = C*e3 - d3*CA;
QC = C*e1 - Q1;
QA = A*e2 - Q2;
QB = B*e3 - Q3;

result =  (  (d*d > rr*e)	|   
	   (  (aa > rr) & (ab > aa) & (ac > aa)  )   |
	   (  (bb > rr) & (ab > bb) & (bc > bb))	  |
	   (  (cc > rr) & (ac > cc) & (bc > cc)  )	  |
	   (  (dot(Q1,Q1)>rr*e1*e1)  &  (dot(Q1,QC)>0) )		|
	   (  (dot(Q2,Q2)>rr*e2*e2)  &  (dot(Q2,QA)>0) )	|
	   (  (dot(Q3,Q3)>rr*e3*e3)  &  (dot(Q3,QB)>0) )	);

return result;

}[/s]

code clarification :

kernels originaly purpose is to calculate collisions for all particles with each other.
the space is devided to cells and every particle have one thread that calculates collisions against all other particles found in the same cell
the collide_Surface() target is to calculate collisions for all particles with surface as well.

each thread iterates over the triangles located in the same cell like the thread particle currently found in.

of course the number of iterations is different for every thread according to number of triangles found in its current cell.

the triSphereOverlap() purpose is to test weather the particle and the triangle are colliding.

my question is how can i improve those two functions and change it to be more cuda- compatible.
sorry for the lengthiness of my question, im pretty desperate since i got to submit this project in a few days, and those two functions, which are not that big, is the last significant hitch.

It depends on how big is your cell. If it is a large cell it might be worth it to use 1 block per cell. For very small cell 1 thread per cell might be enough. In oth cases you can use shred memory to avoid multiple memory transfers.

well the cell is a cube in the size of 3*cell diameter.

now that i think of it, i can reduce the cell size by 33%, thos have less triangles in each cell to check against,

ill do that, thanks.

but anyway,
i tested the program by disabling parts off the code above,
and my conclusion is that triSphereOverlap() is the the most significant factor, far more than the other parts.

as you can see its not so complicated function, but having alot of branches in it,
and as i red, because of the gpu working in SIMD, it can cause mach delayes.

if you can show me how can i adjust the function structure to cuda,
you’ll make me a very happy person :)