syncthreads() problem syncthreads() results in infinite loop

Jover · August 26, 2008, 9:30pm

As far as I remember the following strange behaviour of __syncthreads() was already partially mentioned in the forum. However I haven’t seen any constructive advices regarding it.
So I have the kernel with the given loop inside it:
global void MultiDistance(const int point_size, const int bd_faces_size, FTYPE* glob_dist, const int offset)
{
const int i = blockIdx.x*blockDim.x+threadIdx.x+offset;
float3 pt_cur = make_float3(0,0,0);

if(i < point_size)
	pt_cur = make_float3(tex1Dfetch(texPt, texPtOfs+i));

FTYPE dist = 1.e6;
//define minimal distance between the current point and the external boundary
for(int j=0; j<bd_faces_size; j++)
{
	__shared__ int4 cur_face;
	cur_face = tex1Dfetch(texBdFaces, texBdFacesOfs+j);

	__shared__ float3 pt[4];
	pt[0] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.x));
	pt[1] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.y));
	pt[2] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.z));
	pt[3] = make_float3(tex1Dfetch(texPt, texPtOfs+cur_face.w));
	
	//find a distance from the point to a face only if point's projection belongs to the face
	__shared__ float3 a1;
	a1 = pt[1] - pt[0];
	
	__shared__ float3 a2;
	a2 = pt[2] - pt[0];
	
	__shared__ float3 norm;
	norm = vec(a1,a2); //a normal to the face's plane
	norm /= length(norm);

	const FTYPE proj_dist = fabs(dot(pt_cur-pt[0], norm));

	//project all points at the plane
	__shared__ float3 vx;
	vx = perpendicular(norm.x, norm.y, norm.z);
	
	__shared__ float3 vy;
	vy = vec(norm,vx);
	
	__shared__ float2 proj_pt[4], conv_pt[4];
	for(volatile int k=0; k<4; k++)
	{
		proj_pt[k].x = dot(vx,pt[k]);
		proj_pt[k].y = dot(vy,pt[k]);
	}
	
	//construct a normal polygon on the plane
	convexPolygon<4>(proj_pt, conv_pt);
	
	float2 proj_cur;
	proj_cur.x = dot(vx,pt_cur);
	proj_cur.y = dot(vy,pt_cur);
	const FTYPE plane_dist = tempDist2(4, conv_pt, proj_cur);

	if(j == -1)
		__syncthreads();

	if(plane_dist <= 0)
		dist = min(dist, proj_dist);
	else
		dist = min(dist, sqrt(proj_dist*proj_dist + plane_dist));
}

if(i < point_size)
	glob_dist[i-offset] = dist;

}

There is only one __syncthreads in the kernel particularly
if(j == -1)
__syncthreads();

As one can easily see __syncthreads will never be called. However if I comment it a number of kernel registers will grow from 20 to 38!
The second strange fact is that putting syncthreads without if statement
// if(j == -1)
__syncthreads();
results in cudaErrorLaunchTimeout error. I should mention here that it’s 100% true that all block threads reach this __syncthreads() call (it’s also obvious just from the kernel’s listing). Moreover if I change the cycle count to
for(int j=0; j<2; j++)
the kernel will work with no problems.
However
for(int j=0; j<10; j++)
gives cudaErrorLaunchTimeout error again! :wacko:
The last fact to mention is that watchdog still exists despite I compute on Tesla D870 which is the second card in my Dell Precision 690 under Windows Server 2003 x64 and CUDA 2.0.
Any ideas how to handle these problems?
Thanks in advance.

Sarnath · August 27, 2008, 6:21am

Therez some funda that – The compiler does NOT optimize across __syncthreads() or whatever – So syncthreads() does contribute to “register” count!

The other thing that you are saying is quite strange! Not sure if that one comes under some conditional…

Sarnath · August 27, 2008, 6:25am

if(i < point_size)

pt_cur = make_float3(tex1Dfetch(texPt, texPtOfs+i));

Make the statement inside “if” within “Paranetheses” – not sure how function inlining would handle! – just to be safe…

Sarnath · August 27, 2008, 6:33am

const FTYPE plane_dist = tempDist2(4, conv_pt, proj_cur);

This is the code before the __syncthreads() – Can you share the code of ‘tempDist2’? Probably something to do with it!

btw,

I see that you are not efficiently making the threads work in your problem! There are lot of redundant fetches…

One example:

__shared__ int4 cur_face;

cur_face = tex1Dfetch(texBdFaces, texBdFacesOfs+j);

All threads are fetching the same data and storing it in the same shared variable!!! This is a waste of time as I see (although texture fetches r cached …)! If you know what you are doing here - it is fine! If not, you should probably try to understand more about the programming model!

Good Luck

Jover · August 27, 2008, 7:57am

Sarnath,
first of all thank you for your answer.
Regarding your comments:
I tried to make a conditional statement like this
if(i < point_size)
{
pt_cur = make_float3(tex1Dfetch(texPt, texPtOfs+i));
}
However like it was expected it changed nothing. The program still reports cudaErrorLaunchTimeout with open __syncthreads.
tempDist2 just computes a distances between a 2d point and the given polygon.
Of course I can share it but it contains nothing interesting regarding my problem
device FTYPE tempDist2(const int nv, const float2 vert, const float2 &point)
{
bool isBone = false;
FTYPE dist = pow(length(vert - point), 2);
int index = 0;
if (nv < 3)
{
if ( nv == 1 )
return dist;
const float2 & a = vert[0];
const float2 & b = vert[1];
const float2 c = b - a;
if ( !c )
return dist;
const float2 d = point - a;
FTYPE t = dot(d,c);
if ( t <= 0 )
return dist;
const FTYPE q = dot(c,c);
if ( t >= q )
{
dist = pow(length(point - B), 2);
index = 1;
}
else
{
t = vec(d,c);
dist = (tt) / q;
if(t < 0)
index = 1;
isBone = true;
}
return dist;
}
for(volatile int i = 0; i < nv; ++i)
{
const float2 & a = vert[i?i-1:nv-1];
const float2 & b = vert[i];
const float2 c = b - a;
if(!c)
continue;
const float2 d = point - a;
FTYPE t = dot(d,c);
if(t < 0)
continue;
const FTYPE q = dot(c,c);
if(t >= q)
{
t = pow(length(b - point), 2);
if(dist > t)
{
dist = t;
index = i;
isBone = false;
}
}
else
{
t = vec(d,c);
t = (tt) / q;
if(dist > t)
{
dist = t;
index = i;
isBone = true;
}
}
}
const int i = index > 0 ? index - 1 : nv - 1;
const float2 & a = vert[i];
const float2 & b = vert[index];
if(isBone)
{
index = i;
if(vec(b - a, b - point) < 0)
dist = - dist;
}
else
{
const float2 & c = vert[index+1<nv?index+1:0];
if(vec(b - a, c - B) < 0)
dist = - dist;
}
return dist;
}

Regarding your last comment about shared memory I’m completely agree with you and of course I know that each thread reads the same data. However from my point of view it was the only opportunity to reduce a register usage and get rid of local memory. The purpose of my kernel is to compute minimal distances between all 3d points which belong to the 3d hex mesh and the mesh boundaries. It means that for each point I compute its distance to all boundary faces and then take the minimal one.
So I organized my kernel in such a manner that one thread is one point and inside a kernel I have a loop for all boundary faces. It allowed me to have only one global memory write per thread in the end of the kernel (the defined minimal distance for the given kernel is stored). However like you properly mentioned all threads have to load from the global memory the same data about the given face.
I did it in terms of shared variables just not to use to much local variables (=reduce number of registers) and not to use local arrays like this
float3 pt[4];
So now I hope that I’ve clarified situation and explained why I do so. However the problem with __syncthreads still open and I really have no idea how to fix it. :(
Here is a link for the similar topic where MisterAnderson42 made a suggestion that a large number of __syncthreads leads to possible infinite loop.
[url=“http://forums.nvidia.com/index.php?showtopic=71401”]http://forums.nvidia.com/index.php?showtopic=71401[/url]

Sarnath · August 27, 2008, 9:25am

Ok, This was expected… All right…

As you indicated “tempDist2” code does not matter much…

Nope! You could always say:

if (threadIdx.x == 0)

{

   -- read data into shared memory --

}

__syncthreads();

Just my 2 cents. Much depends on how much memory you are reading…

You could also consider using more threads to load data into shared memory. For example, This is how I load data into shared memory usually… This avoids redundant loads and also results in memory coalescing depending an the amount of data you are loading…

/*

 * cacheDataInSharedMemory(sharedMemAddress, GlobalMemoryAddress, SizeInBytes): 

 * 1. Assumes that "size" argument is a multiple of "sizeof(int)"

 * 2. Assumes 32 threads per block. So, does NOT __syncthread() at the end of the function.

 */

__device__ void cacheDataInSharedMemory(void *shared, void *opt, int size)

{

	int i;

	int *src = (int *) opt;

	int *dst = (int *) shared;

	for(i=threadIdx.x; i<(size/sizeof(int)); i+=blockDim.x)

	{

  dst[i] = src[i];

	}

	return;

}

Also, I would suggest you to move all the “shared” declarations to the top of the function and not to declare them in the middle.

Here is my guess (kindaa wild) about the syncthreads issue:

I think some of your threads are exiting in the middle owing to division by zero or something – resulting in syncthreads() hang!

If possible comment out operations that can result in INFINITY or Non-deterministic output like 0/0 or log(0) etc…

If nothing works, I suggest you to comment out code piece by piece and see which code actually breaks your kernel!! – Thats the only debug method that I am aware of – And it always works for me!

Good Luck!

Topic		Replies	Views
__syncthreads question CUDA Programming and Performance	9	2028	September 30, 2009
Semantics of __syncthreads CUDA Programming and Performance	18	17991	January 2, 2008
__syncthreads() issue CUDA Programming and Performance	10	1154	February 10, 2011
Why this kernel "needs" a __syncthreads() statement to work? CUDA Programming and Performance	3	836	May 7, 2017
Is __syncthreads() abnormal in for loop? CUDA Programming and Performance	3	3046	November 14, 2013
__syncthreads() + shared memory issue CUDA Programming and Performance	7	5600	August 26, 2008
__syncthreads() limitation.. Help please.. CUDA Programming and Performance	16	7052	January 4, 2009
Relaxed __syncthreads() proposal. CUDA Programming and Performance	15	11542	January 2, 2011
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2773	December 8, 2010
How can I test to see the usefullness of `__syncthreads()`? CUDA Programming and Performance	2	306	August 12, 2023

__syncthreads() problem __syncthreads() results in infinite loop

Related topics

syncthreads() problem syncthreads() results in infinite loop