threads violating for loop

little_jimmy · June 11, 2019, 9:24am

hi

is the following for loop in a kernel valid?

for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)

s_int[0] is shared memory (int), and in this case the value is 36. blockDim.x = 256
threads with threadIdx.x > s_int[0] enter the loop too, and the debugger then show their threadIdx.x (actual threadIdx.x) smaller than s_int[0], while the debugger show their thread nr (debugger nr for thread(Idx.x) as > s_int[0];

this still occurs when i change the for to a while loop

l_int[0] = threadIdx.x;

while (l_int[0] < s_int[0])
{
...
l_int[0]+= blockDim.x;
}

in the image, i have selected thread 38, which should not participate in the loop.
the debugger shows threadIdx.x as 32 when i place the mouse on it (see line 57).
again, s_int[0] shows as 36

l_int[0] is declared as: int l_int[6];

little_jimmy · June 11, 2019, 9:32am

it works if i change l_int[0] to l_intx, and use l_intx instead of l_int[0] for the loop:

int l_intx;
int l_int[6];

why is l_int[0] (int l_int[6]) corrupting the loop? i take it to be an array of int in local memory, which is far easier to use than: int int_1, int_2, int_3, int_4, int_5

Robert_Crovella · June 11, 2019, 2:52pm

There’s nothing illegal about using an integer array:

int l_int[6];

whether for a loop variable or any other valid C++ usage.

If you want debugging help, my suggestion would be to provide a short, complete code, that someone else could compile and run and see the issue, without having to add anything or change anything. Since s_int is a shared variable, you may have some sort of race condition on that, for example.

My guess would be that in the process of creating such example, you may discover the root cause of the problem yourself.

little_jimmy · June 11, 2019, 6:43pm

can threadIdx.x change/ be changed?
where is it stored? is it (not) stored as a constant?

Robert_Crovella · June 11, 2019, 6:54pm

it cannot be changed

It is an implementation detail, when you access threadIdx.x in C/C++ device code, underneath the hood it loads that value from a special register. The register value does not change, and you cannot modify the register value, nor can you write to it.

little_jimmy · June 11, 2019, 7:22pm

i am still pulling apart and debugging. at first i thought i wrote the kernel too complex. if i write a simple block, it works fine.

please see if you can reproduce the issue, if you dont mind. below the code. i tried to make the kernel as simple as possble.

you will/ should note that the first

for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)

works fine. threads outside/ over/ above the threshold (s_int[0]) jump the loop, as they should.
the 2nd for loop does not work as it should. the debugger shows threads greater than the threshold changing their threadIdx.x to the threshold, and they then enter the for loop.

you can change:

__global__ void x_in_y_prob(int* d_int_data, int** d_int_ptr, double** d_dbl_ptr)

to

__global__ void x_in_y_prob(double* d_dbl_ptr)

and

d_dbl_ptr[0][l_int[1]] = l_dbl[0];

to

d_dbl_ptr[l_int[1]] = l_dbl[0];

then you only need to create and pass a double array of 256

kernel dimensions:

dim3 dGx(1, 1, 1);
dim3 dBx(256, 1, 1);

__global__ void x_in_y_prob(int* d_int_data, int** d_int_ptr, double** d_dbl_ptr)
{
	__shared__ int s_int[2];

	if (threadIdx.x == 0)
	{
		s_int[0] = 36;
		s_int[1] = 0;
	}

	__syncthreads();

	{
		int l_int[6];
		double l_dbl[1];

		for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)
		{
			l_int[1] = l_int[0] * 2;
			l_int[2] = l_int[1] * 5;
			l_int[3] = l_int[2] * 4;
			l_int[4] = l_int[3] * 6;
			l_int[5] = l_int[4] * 7;
			l_dbl[0] = l_int[5] / 2.4342;
			d_dbl_ptr[0][l_int[1]] = l_dbl[0];
		}
	}

	__syncthreads();

	{
		int l_int[6];
		double l_dbl[1];

		for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)
		{
			l_int[1] = s_int[1] + l_int[0];
			l_int[2] = threadIdx.x % 4;
			l_int[3] = threadIdx.x / 4;

			if (l_int[2] == l_int[3])
			{
				l_int[4] = 1;
				l_int[5] = 1;
			}

			else if (l_int[2] == 1)
			{
				l_int[4] = l_int[3];
				l_int[5] = 1;
			}

			else
			{
				l_int[4] = 1;
				l_int[5] = 1;
/*
				l_int[4] = get_x1s_in_ybits(l_int[2], l_int[3]);

				l_int[2]--;
				l_int[3]--;

				l_int[5] = get_x1s_in_ybits(l_int[2], l_int[3]);
*/
			}

			l_dbl[0] = l_int[4];
			l_dbl[0] = l_int[5] / l_dbl[0];
			d_dbl_ptr[0][l_int[1]] = l_dbl[0];
		}
	}
}

__device__ int get_x1s_in_ybits(int x1s, int ybits)
{
	int l_int[2];
	int res, ptr, count;
	int res_pnt[max_x_in_y_cnt];
	bool range_cnt[max_x_in_y_cnt];

	if ((x1s >= ybits) || ((x1s < 1) || (ybits < 1)))
	{
		l_int[0] = 0;

		if ((x1s == ybits) && (x1s > 0))
		{
			l_int[0] = 1;
		}

		return l_int[0];
	}

	range_cnt[0] = 1;
	res_pnt[0] = x1s;
	res = x1s;
	ptr = 0;
	count = 0;

	while (1)
	{
		ptr++;
		l_int[0] = ybits - ptr - res;

		if (l_int[0] < 1)
		{
			count++;
			l_int[0] = 1;
		}

		else
		{
			res_pnt[ptr] = res;
			range_cnt[ptr] = 1;
			l_int[0] = 0;
		}

		if (l_int[0] > 0)
		{
			l_int[0] = ptr - 1;

			while (l_int[0] >= 0)
			{
				l_int[1] = range_cnt[l_int[0]];

				if (l_int[1] > 0)
				{
					break;
				}

				l_int[0]--;
			}

			if (l_int[0] < 0)
			{
				break;
			}

			ptr = l_int[0];
			l_int[1] = res_pnt[ptr] - 1;
			res_pnt[ptr] = l_int[1];
			res = l_int[1];
			range_cnt[ptr] = 0;
		}
	}

	return count;
}

Robert_Crovella · June 11, 2019, 7:35pm

“copy, paste, compile, and run”

If I can’t do that, I’m less likely to help.

little_jimmy · June 11, 2019, 8:18pm

let me break up the kernel and simplify it, so that it is easier to debug.

Topic		Replies	Views
error in using threadIdx.x as integer CUDA Programming and Performance	6	1369	August 11, 2010
using same threadIdx for different variables CUDA Programming and Performance	3	1939	May 8, 2012
Problem with Threadidx not being set (or always zero) CUDA Programming and Performance	6	8655	July 19, 2010
shared memory problem usage in variables CUDA Programming and Performance	8	2550	September 22, 2010
Performance penalty of using threadIdx.x CUDA Programming and Performance	2	1043	December 27, 2012
address evaluation threadIdx,blockDim treated as constants? CUDA Programming and Performance	17	15960	May 20, 2008
Thread Id as loop condition Using thread Id as an upper bound within a for loop causes crash CUDA Programming and Performance	10	15514	September 13, 2010
threads in a loop threads go missing CUDA Programming and Performance	13	8480	September 9, 2008
number of threads and number of data CUDA Programming and Performance	2	1171	July 16, 2009
Using threads in device function. CUDA Programming and Performance	0	1054	March 12, 2009

threads violating for loop

Related topics