How To Reduce Branches?

Hi everyone.

I’m implementing a function very similar to the nearest neighbour.

The difference to the common nearest neighbour or 3D range search, is that I’m computing the minimum distance of a point to a set of triangles (usually is a set of points).

Due to this, I cannot use the existing GPU implementations of nearest neighbour.

My problem, is that the code I’m using (http://www.geometrictools.com/Documentation/DistancePoint3Triangle3.pdf) has lots of branches, which results in a lot of path divergence. I believe that in a warp, only 10 threads follow the same path.

The excerpt below shows a C implementation of these branches. Does anyone knows a way to avoid them, or at least, to reduce the divergence? Basically I have a set of points, which I map in a one dimension grid divided in several blocks.

dim3 dimBlock(256);

dim3 dimGrid((totalPoints/dimBlock.x) + (totalPoints % dimBlock.x == 0?0:1));

The kernel uses this variables:

	vector kDiff, kEdge0, kEdge1; //Each vector is a structure with 3 floats

	vector m_kClosestPoint1;

	float fA00, fA01, fA11, fB0, fB1, fC, fDet, fS, fT, fSqrDistance;

My graphic is an old Geforce 8500GT.

The code excerpt is the following:

[codebox] if (fS + fT <= fDet)

	{

		if (fS < 0.0f)

		{

			if (fT < 0.0f) // region 4

			{

				if (fB0 < 0.0f)

				{

					fT = 0.0f;

					if (-fB0 >= fA00)

					{

						fS = 1.0f;

						fSqrDistance = fA00 + 2.0f * fB0 + fC;

					}

					else

					{

						fS = -fB0/fA00;

						fSqrDistance = fB0 * fS + fC;

					}

				}

				else

				{

					fS = 0.0f;

					if (fB1 >= 0.0f)

					{

						fT = 0.0f;

						fSqrDistance = fC;

					}

					else if (-fB1 >= fA11)

					{

						fT = 1.0f;

						fSqrDistance = fA11 + 2.0f * fB1 + fC;

					}

					else

					{

						fT = -fB1/fA11;

						fSqrDistance = fB1 *fT + fC;

					}

				}

			}

			else // region 3

			{

				fS = 0.0f;

				if (fB1 >= 0.0f)

				{

					fT = 0.0f;

					fSqrDistance = fC;

				}

				else if (-fB1 >= fA11)

				{

					fT = 1.0f;

					fSqrDistance = fA11 + 2.0f * fB1 + fC;

				}

				else

				{

					fT = -fB1/fA11;

					fSqrDistance = fB1 * fT + fC;

				}

			}

		}

		else if (fT < 0.0f) // Region 5

		{

			fT = 0.0f;

			if (fB0 >= 0.0f)

			{

				fS = 0.0f;

				fSqrDistance = fC;

			}

			else if (-fB0 >= fA00)

			{

				fS = 1.0f;

				fSqrDistance = fA00 + 2.0f * fB0 + fC;

			}

			else

			{

				fS = -fB0/fA00;

				fSqrDistance = fB0 * fS + fC;

			}

		}

		else // region 0

		{

			// minimum at interior point

			float fInvDet = 1.0f/fDet;

			fS *= fInvDet;

			fT *= fInvDet;

			fSqrDistance = fS*(fA00*fS+fA01*fT+2.0f*fB0) + fT*(fA01*fS+fA11*fT+2.0f*fB1)+fC;

		}

	}

	else

	{

		float fTmp0, fTmp1, fNumer, fDenom;

		if (fS < 0.0f) // region 2

		{

			fTmp0 = fA01 + fB0;

			fTmp1 = fA11 + fB1;

			if (fTmp1 > fTmp0)

			{

				fNumer = fTmp1 - fTmp0;

				fDenom = fA00 - 2.0f * fA01 + fA11;

				if (fNumer >= fDenom)

				{

					fS = 1.0f;

					fT = 0.0f;

					fSqrDistance = fA00 + 2.0f * fB0 + fC;

				}

				else

				{

					fS = fNumer/fDenom;

					fT = 1.0f - fS;

					fSqrDistance = fS * (fA00*fS+fA01*fT+2.0f*fB0) + fT*(fA01*fS+fA11*fT+2.0f*fB1)+fC;

				}

			}

			else

			{

				fS = 0.0f;

				if (fTmp1 <= 0.0f)

				{

					fT = 1.0f;

					fSqrDistance = fA11 + 2.0f * fB1 + fC;

				}

				else if (fB1 >= 0.0f)

				{

					fT = 0.0f;

					fSqrDistance = fC;

				}

				else

				{

					fT = -fB1/fA11;

					fSqrDistance = fB1*fT+fC;

				}

			}

		}

		else if (fT < 0.0f) // region 6

		{

			fTmp0 = fA01 + fB1;

			fTmp1 = fA00 + fB0;

			if (fTmp1 > fTmp0)

			{

				fNumer = fTmp1 - fTmp0;

				fDenom = fA00 - 2.0f * fA01 + fA11;

				if (fNumer >= fDenom)

				{

					fT = 1.0f;

					fS = 0.0f;

					fSqrDistance = fA11 + 2.0f * fB1 + fC;

				}

				else

				{

					fT = fNumer/fDenom;

					fS = 1.0f - fT;

					fSqrDistance = fS*(fA00*fS+fA01*fT+2.0f*fB0) + fT*(fA01*fS+fA11*fT+2.0f*fB1)+fC;

				}

			}

			else

			{

				fT = 0.0f;

				if (fTmp1 <= 0.0f)

				{

					fS = 1.0f;

					fSqrDistance = fA00 + 2.0f * fB0 + fC;

				}

				else if (fB0 >= 0.0f)

				{

					fS = 0.0f;

					fSqrDistance = fC;

				}

				else

				{

					fS = -fB0/fA00;

					fSqrDistance = fB0 * fS + fC;

				}

			}

		}

		else // region 1

		{

			fNumer = fA11 + fB1 - fA01 - fB0;

			if (fNumer <= 0.0f)

			{

				fS = 0.0f;

				fT = 1.0f;

				fSqrDistance = fA11 + 2.0f * fB1 + fC;

			}

			else

			{

				fDenom = fA00 - 2.0f * fA01 + fA11;

				if (fNumer >= fDenom)

				{

					fS = 1.0f;

					fT = 0.0f;

					fSqrDistance = fA00 + 2.0f * fB0 + fC;

				}

				else

				{

					fS = fNumer/fDenom;

					fT = 1.0f - fS;

					fSqrDistance = fS*(fA00*fS+fA01*fT+2.0f*fB0) + fT*(fA01*fS+fA11*fT+2.0f*fB1)+fC;

				}

			}

		}

	}

	// account for numerical round-off error

	if (fSqrDistance < 0.0f)

	{

		fSqrDistance = 0.0f;

	}

[/codebox]

Thanks for your help.

Hello,

The thing I would try is (with an example) :

if(a > 0.f)

   x = 5.f;

else

  x = 8.f;

You can translate this into :

float a0 = (a > 0.f) * 1.f;

x = 5.f * a0;

x += 8.f * (1.f - a0);

Your code is very much longer, but it seems possible to translate it this way.

By the way, your kernel will need more registers, Also it could be an idea not to store bool-like variables like a0, but I think it’ll produce much more instructions, higher register pressure, and perhaps an overhead due to the repeated conversion bool->float (?)

Hello,

The thing I would try is (with an example) :

if(a > 0.f)

   x = 5.f;

else

  x = 8.f;

You can translate this into :

float a0 = (a > 0.f) * 1.f;

x = 5.f * a0;

x += 8.f * (1.f - a0);

Your code is very much longer, but it seems possible to translate it this way.

By the way, your kernel will need more registers, Also it could be an idea not to store bool-like variables like a0, but I think it’ll produce much more instructions, higher register pressure, and perhaps an overhead due to the repeated conversion bool->float (?)

Or just
[font=“Courier New”]x = a>0.f ? 5.f : 8.f;[/font]

Or just
[font=“Courier New”]x = a>0.f ? 5.f : 8.f;[/font]

Yes, forgot about this famous “instruction prefetching”

(I’m looking for a good paper that talks about CUDA instruction prefetch, anyone knows one ? google gives me quite nothing)

Yes, forgot about this famous “instruction prefetching”

(I’m looking for a good paper that talks about CUDA instruction prefetch, anyone knows one ? google gives me quite nothing)

Thanks for your help.

I just have one more question.

If I have something like:

[codebox]if (x > =0)

x = 5;

else if (y < 0)

x = 7;

else

x = 10; [/codebox]

It is valid to use the ternary operator this way:

[codebox]x = (x >=0) ? (5) : ((y< 0) ? (7) : (10));[/codebox]

The code compiles and the results are correct, however I do not know if the final assembly\ptx code is branchless.

I’m currently modifying the code to avoid all the ifs and elses, but only when I’m done I will be able to know if the code is more efficient.

Meanwhile if you can help me with my question above, I appreciate.

Thanks for your help.

Thanks for your help.

I just have one more question.

If I have something like:

[codebox]if (x > =0)

x = 5;

else if (y < 0)

x = 7;

else

x = 10; [/codebox]

It is valid to use the ternary operator this way:

[codebox]x = (x >=0) ? (5) : ((y< 0) ? (7) : (10));[/codebox]

The code compiles and the results are correct, however I do not know if the final assembly\ptx code is branchless.

I’m currently modifying the code to avoid all the ifs and elses, but only when I’m done I will be able to know if the code is more efficient.

Meanwhile if you can help me with my question above, I appreciate.

Thanks for your help.

Yes, it’s valid, and most likely gets compiled to the same code.

By the way, it seems possible to trick the compiler into emitting branchless code by using a temporary variable:

int t = (y< 0) ? 7 : 10;

x = (x >=0) ? 5 : t;

Yes, it’s valid, and most likely gets compiled to the same code.

By the way, it seems possible to trick the compiler into emitting branchless code by using a temporary variable:

int t = (y< 0) ? 7 : 10;

x = (x >=0) ? 5 : t;

But is that extra variable really necessary? Are we not using an extra register that way?

But is that extra variable really necessary? Are we not using an extra register that way?

When you see so many ifs, I suggest trying another algorithm without the ifs first.

If I understand the problem correctly, it is finding the distance between a point and a triangle in 3D.

Your algorithm seems fine for serial calculation, but the cost of calculations on the GPU is less and that of branches more.

So try a simple vector calculation to get the advantages of the GPU, as follows:

  1. calculate distance of point to plane of triangle by projecting the vector from P to one of the vertices on the normal of the plane.

  2. calculate the projected point P’ in the plane of the triangle

  3. compute the distance of P’ to the triangle.

  4. combine these two distances using pythagoras to the final result

Serial code is attached. You will see there are some ifs, but they occur at the end of the routine and do not add up to a lot.

I leave porting to cuda to you, if you wish, or maybe I’ll get around to it.

The function to be “kernalized” is calculate_distance_cpu().

I think my code is correct, but since this is just an example, I didn’t check thoroughly.

If anyone checks correctness of results, please let me know.

A project file is included for VC 9.

Have fun.
triangle_distance.zip (3.32 KB)

When you see so many ifs, I suggest trying another algorithm without the ifs first.

If I understand the problem correctly, it is finding the distance between a point and a triangle in 3D.

Your algorithm seems fine for serial calculation, but the cost of calculations on the GPU is less and that of branches more.

So try a simple vector calculation to get the advantages of the GPU, as follows:

  1. calculate distance of point to plane of triangle by projecting the vector from P to one of the vertices on the normal of the plane.

  2. calculate the projected point P’ in the plane of the triangle

  3. compute the distance of P’ to the triangle.

  4. combine these two distances using pythagoras to the final result

Serial code is attached. You will see there are some ifs, but they occur at the end of the routine and do not add up to a lot.

I leave porting to cuda to you, if you wish, or maybe I’ll get around to it.

The function to be “kernalized” is calculate_distance_cpu().

I think my code is correct, but since this is just an example, I didn’t check thoroughly.

If anyone checks correctness of results, please let me know.

A project file is included for VC 9.

Have fun.

Hi.

I will check your code to see if it is correct. If everything is fine I will port it to CUDA and post the results here.

Thanks.

Hi.

I will check your code to see if it is correct. If everything is fine I will port it to CUDA and post the results here.

Thanks.

I have improved the code (calculating distance between point and triangle in 3D) and checked its correctness. This was rather more work External Media than I anticipated, partly because differences with the code you quote from were only resolved utlimately by changing a little bit of the wildmagic5 code.

The result is interesting, but not finished and certainly not easy to use right now, unless you can spend a bit of time on it.

But I think it is correct and the algorithm is rather straightforward with few ifs.

There is a cpu, a cuda and an adapted wildmagic implementation.

Hope it is useful, see attachment.

EDIT:

Have appended version 0.3 with the issues of correctness clarified.
triangle_distance_v0.3.zip (432 KB)
triangle_distance_v0.2.zip (353 KB)

I have improved the code (calculating distance between point and triangle in 3D) and checked its correctness. This was rather more work External Media than I anticipated, partly because differences with the code you quote from were only resolved utlimately by changing a little bit of the wildmagic5 code.

The result is interesting, but not finished and certainly not easy to use right now, unless you can spend a bit of time on it.

But I think it is correct and the algorithm is rather straightforward with few ifs.

There is a cpu, a cuda and an adapted wildmagic implementation.

Hope it is useful, see attachment.

EDIT:

Have appended version 0.3 with the issues of correctness clarified.