threads violating for loop

hi

is the following for loop in a kernel valid?

for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)

s_int[0] is shared memory (int), and in this case the value is 36. blockDim.x = 256
threads with threadIdx.x > s_int[0] enter the loop too, and the debugger then show their threadIdx.x (actual threadIdx.x) smaller than s_int[0], while the debugger show their thread nr (debugger nr for thread(Idx.x) as > s_int[0];

this still occurs when i change the for to a while loop

l_int[0] = threadIdx.x;

while (l_int[0] < s_int[0])
{
...
l_int[0]+= blockDim.x;
}

in the image, i have selected thread 38, which should not participate in the loop.
the debugger shows threadIdx.x as 32 when i place the mouse on it (see line 57).
again, s_int[0] shows as 36

l_int[0] is declared as: int l_int[6];

it works if i change l_int[0] to l_intx, and use l_intx instead of l_int[0] for the loop:

int l_intx;
int l_int[6];

why is l_int[0] (int l_int[6]) corrupting the loop? i take it to be an array of int in local memory, which is far easier to use than: int int_1, int_2, int_3, int_4, int_5

There’s nothing illegal about using an integer array:

int l_int[6];

whether for a loop variable or any other valid C++ usage.

If you want debugging help, my suggestion would be to provide a short, complete code, that someone else could compile and run and see the issue, without having to add anything or change anything. Since s_int is a shared variable, you may have some sort of race condition on that, for example.

My guess would be that in the process of creating such example, you may discover the root cause of the problem yourself.

can threadIdx.x change/ be changed?
where is it stored? is it (not) stored as a constant?

it cannot be changed

It is an implementation detail, when you access threadIdx.x in C/C++ device code, underneath the hood it loads that value from a special register. The register value does not change, and you cannot modify the register value, nor can you write to it.

i am still pulling apart and debugging. at first i thought i wrote the kernel too complex. if i write a simple block, it works fine.

please see if you can reproduce the issue, if you dont mind. below the code. i tried to make the kernel as simple as possble.

you will/ should note that the first

for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)

works fine. threads outside/ over/ above the threshold (s_int[0]) jump the loop, as they should.
the 2nd for loop does not work as it should. the debugger shows threads greater than the threshold changing their threadIdx.x to the threshold, and they then enter the for loop.

you can change:

__global__ void x_in_y_prob(int* d_int_data, int** d_int_ptr, double** d_dbl_ptr)

to

__global__ void x_in_y_prob(double* d_dbl_ptr)

and

d_dbl_ptr[0][l_int[1]] = l_dbl[0];

to

d_dbl_ptr[l_int[1]] = l_dbl[0];

then you only need to create and pass a double array of 256

kernel dimensions:

dim3 dGx(1, 1, 1);
dim3 dBx(256, 1, 1);
__global__ void x_in_y_prob(int* d_int_data, int** d_int_ptr, double** d_dbl_ptr)
{
	__shared__ int s_int[2];

	if (threadIdx.x == 0)
	{
		s_int[0] = 36;
		s_int[1] = 0;
	}

	__syncthreads();

	{
		int l_int[6];
		double l_dbl[1];

		for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)
		{
			l_int[1] = l_int[0] * 2;
			l_int[2] = l_int[1] * 5;
			l_int[3] = l_int[2] * 4;
			l_int[4] = l_int[3] * 6;
			l_int[5] = l_int[4] * 7;
			l_dbl[0] = l_int[5] / 2.4342;
			d_dbl_ptr[0][l_int[1]] = l_dbl[0];
		}
	}

	__syncthreads();

	{
		int l_int[6];
		double l_dbl[1];

		for (l_int[0] = threadIdx.x; l_int[0] < s_int[0]; l_int[0] += blockDim.x)
		{
			l_int[1] = s_int[1] + l_int[0];
			l_int[2] = threadIdx.x % 4;
			l_int[3] = threadIdx.x / 4;

			if (l_int[2] == l_int[3])
			{
				l_int[4] = 1;
				l_int[5] = 1;
			}

			else if (l_int[2] == 1)
			{
				l_int[4] = l_int[3];
				l_int[5] = 1;
			}

			else
			{
				l_int[4] = 1;
				l_int[5] = 1;
/*
				l_int[4] = get_x1s_in_ybits(l_int[2], l_int[3]);

				l_int[2]--;
				l_int[3]--;

				l_int[5] = get_x1s_in_ybits(l_int[2], l_int[3]);
*/
			}

			l_dbl[0] = l_int[4];
			l_dbl[0] = l_int[5] / l_dbl[0];
			d_dbl_ptr[0][l_int[1]] = l_dbl[0];
		}
	}
}

__device__ int get_x1s_in_ybits(int x1s, int ybits)
{
	int l_int[2];
	int res, ptr, count;
	int res_pnt[max_x_in_y_cnt];
	bool range_cnt[max_x_in_y_cnt];

	if ((x1s >= ybits) || ((x1s < 1) || (ybits < 1)))
	{
		l_int[0] = 0;

		if ((x1s == ybits) && (x1s > 0))
		{
			l_int[0] = 1;
		}

		return l_int[0];
	}

	range_cnt[0] = 1;
	res_pnt[0] = x1s;
	res = x1s;
	ptr = 0;
	count = 0;

	while (1)
	{
		ptr++;
		l_int[0] = ybits - ptr - res;

		if (l_int[0] < 1)
		{
			count++;
			l_int[0] = 1;
		}

		else
		{
			res_pnt[ptr] = res;
			range_cnt[ptr] = 1;
			l_int[0] = 0;
		}

		if (l_int[0] > 0)
		{
			l_int[0] = ptr - 1;

			while (l_int[0] >= 0)
			{
				l_int[1] = range_cnt[l_int[0]];

				if (l_int[1] > 0)
				{
					break;
				}

				l_int[0]--;
			}

			if (l_int[0] < 0)
			{
				break;
			}

			ptr = l_int[0];
			l_int[1] = res_pnt[ptr] - 1;
			res_pnt[ptr] = l_int[1];
			res = l_int[1];
			range_cnt[ptr] = 0;
		}
	}

	return count;
}

http://sscce.org/

“copy, paste, compile, and run”

If I can’t do that, I’m less likely to help.

let me break up the kernel and simplify it, so that it is easier to debug.