__syncthreads() problem __syncthreads() results in infinite loop

As far as I remember the following strange behaviour of __syncthreads() was already partially mentioned in the forum. However I haven’t seen any constructive advices regarding it.
So I have the kernel with the given loop inside it:
global void MultiDistance(const int point_size, const int bd_faces_size, FTYPE* glob_dist, const int offset)
{
const int i = blockIdx.x*blockDim.x+threadIdx.x+offset;
float3 pt_cur = make_float3(0,0,0);

if(i < point_size)
	pt_cur = make_float3(tex1Dfetch(texPt, texPtOfs+i));

FTYPE dist = 1.e6;
//define minimal distance between the current point and the external boundary
for(int j=0; j<bd_faces_size; j++)
{
	__shared__ int4 cur_face;
	cur_face = tex1Dfetch(texBdFaces, texBdFacesOfs+j);

	__shared__ float3 pt[4];
	pt[0] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.x));
	pt[1] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.y));
	pt[2] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.z));
	pt[3] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.w));
	
	//find a distance from the point to a face only if point's projection belongs to the face
	__shared__ float3 a1;
	a1 = pt[1] - pt[0];
	
	__shared__ float3 a2;
	a2 = pt[2] - pt[0];
	
	__shared__ float3 norm;
	norm = vec(a1,a2); //a normal to the face's plane
	norm /= length(norm);

	const FTYPE proj_dist = fabs(dot(pt_cur-pt[0], norm));

	//project all points at the plane
	__shared__ float3 vx;
	vx = perpendicular(norm.x, norm.y, norm.z);
	
	__shared__ float3 vy;
	vy = vec(norm,vx);
	
	__shared__ float2 proj_pt[4], conv_pt[4];
	for(volatile int k=0; k<4; k++)
	{
		proj_pt[k].x = dot(vx,pt[k]);
		proj_pt[k].y = dot(vy,pt[k]);
	}
	
	//construct a normal polygon on the plane
	convexPolygon<4>(proj_pt, conv_pt);
	
	float2 proj_cur;
	proj_cur.x = dot(vx,pt_cur);
	proj_cur.y = dot(vy,pt_cur);
	const FTYPE plane_dist = tempDist2(4, conv_pt, proj_cur);

	if(j == -1)
		__syncthreads();

	if(plane_dist <= 0)
		dist = min(dist, proj_dist);
	else
		dist = min(dist, sqrt(proj_dist*proj_dist + plane_dist));
}

if(i < point_size)
	glob_dist[i-offset] = dist;

}

There is only one __syncthreads in the kernel particularly
if(j == -1)
__syncthreads();

As one can easily see __syncthreads will never be called. However if I comment it a number of kernel registers will grow from 20 to 38!
The second strange fact is that putting syncthreads without if statement
// if(j == -1)
__syncthreads();
results in cudaErrorLaunchTimeout error. I should mention here that it’s 100% true that all block threads reach this __syncthreads() call (it’s also obvious just from the kernel’s listing). Moreover if I change the cycle count to
for(int j=0; j<2; j++)
the kernel will work with no problems.
However
for(int j=0; j<10; j++)
gives cudaErrorLaunchTimeout error again! :wacko:
The last fact to mention is that watchdog still exists despite I compute on Tesla D870 which is the second card in my Dell Precision 690 under Windows Server 2003 x64 and CUDA 2.0.
Any ideas how to handle these problems?
Thanks in advance.

Therez some funda that – The compiler does NOT optimize across __syncthreads() or whatever – So syncthreads() does contribute to “register” count!

The other thing that you are saying is quite strange! Not sure if that one comes under some conditional…

if(i < point_size)

pt_cur = make_float3(tex1Dfetch(texPt, texPtOfs+i));

Make the statement inside “if” within “Paranetheses” – not sure how function inlining would handle! – just to be safe…

const FTYPE plane_dist = tempDist2(4, conv_pt, proj_cur);

This is the code before the __syncthreads() – Can you share the code of ‘tempDist2’? Probably something to do with it!

btw,

I see that you are not efficiently making the threads work in your problem! There are lot of redundant fetches…

One example:

__shared__ int4 cur_face;

cur_face = tex1Dfetch(texBdFaces, texBdFacesOfs+j);

All threads are fetching the same data and storing it in the same shared variable!!! This is a waste of time as I see (although texture fetches r cached …)! If you know what you are doing here - it is fine! If not, you should probably try to understand more about the programming model!

Good Luck

Sarnath,
first of all thank you for your answer.
Regarding your comments:
I tried to make a conditional statement like this
if(i < point_size)
{
pt_cur = make_float3(tex1Dfetch(texPt, texPtOfs+i));
}
However like it was expected it changed nothing. The program still reports cudaErrorLaunchTimeout with open __syncthreads.
tempDist2 just computes a distances between a 2d point and the given polygon.
Of course I can share it but it contains nothing interesting regarding my problem
device FTYPE tempDist2(const int nv, const float2 vert, const float2 &point)
{
bool isBone = false;
FTYPE dist = pow(length(vert - point), 2);
int index = 0;
if (nv < 3)
{
if ( nv == 1 )
return dist;
const float2 & a = vert[0];
const float2 & b = vert[1];
const float2 c = b - a;
if ( !c )
return dist;
const float2 d = point - a;
FTYPE t = dot(d,c);
if ( t <= 0 )
return dist;
const FTYPE q = dot(c,c);
if ( t >= q )
{
dist = pow(length(point - B), 2);
index = 1;
}
else
{
t = vec(d,c);
dist = (t
t) / q;
if(t < 0)
index = 1;
isBone = true;
}
return dist;
}
for(volatile int i = 0; i < nv; ++i)
{
const float2 & a = vert[i?i-1:nv-1];
const float2 & b = vert[i];
const float2 c = b - a;
if(!c)
continue;
const float2 d = point - a;
FTYPE t = dot(d,c);
if(t < 0)
continue;
const FTYPE q = dot(c,c);
if(t >= q)
{
t = pow(length(b - point), 2);
if(dist > t)
{
dist = t;
index = i;
isBone = false;
}
}
else
{
t = vec(d,c);
t = (t
t) / q;
if(dist > t)
{
dist = t;
index = i;
isBone = true;
}
}
}
const int i = index > 0 ? index - 1 : nv - 1;
const float2 & a = vert[i];
const float2 & b = vert[index];
if(isBone)
{
index = i;
if(vec(b - a, b - point) < 0)
dist = - dist;
}
else
{
const float2 & c = vert[index+1<nv?index+1:0];
if(vec(b - a, c - B) < 0)
dist = - dist;
}
return dist;
}

Regarding your last comment about shared memory I’m completely agree with you and of course I know that each thread reads the same data. However from my point of view it was the only opportunity to reduce a register usage and get rid of local memory. The purpose of my kernel is to compute minimal distances between all 3d points which belong to the 3d hex mesh and the mesh boundaries. It means that for each point I compute its distance to all boundary faces and then take the minimal one.
So I organized my kernel in such a manner that one thread is one point and inside a kernel I have a loop for all boundary faces. It allowed me to have only one global memory write per thread in the end of the kernel (the defined minimal distance for the given kernel is stored). However like you properly mentioned all threads have to load from the global memory the same data about the given face.
I did it in terms of shared variables just not to use to much local variables (=reduce number of registers) and not to use local arrays like this
float3 pt[4];
So now I hope that I’ve clarified situation and explained why I do so. However the problem with __syncthreads still open and I really have no idea how to fix it. :(
Here is a link for the similar topic where MisterAnderson42 made a suggestion that a large number of __syncthreads leads to possible infinite loop.
http://forums.nvidia.com/index.php?showtopic=71401

Ok, This was expected… All right…

As you indicated “tempDist2” code does not matter much…

Nope! You could always say:

if (threadIdx.x == 0)

{

   -- read data into shared memory --

}

__syncthreads();

Just my 2 cents. Much depends on how much memory you are reading…

You could also consider using more threads to load data into shared memory. For example, This is how I load data into shared memory usually… This avoids redundant loads and also results in memory coalescing depending an the amount of data you are loading…

/*

 * cacheDataInSharedMemory(sharedMemAddress, GlobalMemoryAddress, SizeInBytes): 

 * 1. Assumes that "size" argument is a multiple of "sizeof(int)"

 * 2. Assumes 32 threads per block. So, does NOT __syncthread() at the end of the function.

 */

__device__ void cacheDataInSharedMemory(void *shared, void *opt, int size)

{

	int i;

	int *src = (int *) opt;

	int *dst = (int *) shared;

	for(i=threadIdx.x; i<(size/sizeof(int)); i+=blockDim.x)

	{

  dst[i] = src[i];

	}

	return;

}

Also, I would suggest you to move all the “shared” declarations to the top of the function and not to declare them in the middle.

Here is my guess (kindaa wild) about the syncthreads issue:

I think some of your threads are exiting in the middle owing to division by zero or something – resulting in syncthreads() hang!

If possible comment out operations that can result in INFINITY or Non-deterministic output like 0/0 or log(0) etc…

If nothing works, I suggest you to comment out code piece by piece and see which code actually breaks your kernel!! – Thats the only debug method that I am aware of – And it always works for me!

Good Luck!