Need a little help to understand how thread change/works

Ok, im new to CUDA and despite ive read many tutorials, still cant get a clear picture how does threads change durning kernel execution. Here is a little example taht ive got a big problem:

__global__ void kernelInit(int *tab, int row)

{

	int tid = threadIdx.x;

	int bid = blockIdx.x;

	if(tid < row)

		tab[bid * row + tid] = 1;

}

__global__ void kernelAdd(int *tab, int *result, int row)

{

	int tid = threadIdx.x;

	int bid = blockIdx.x;

	

	if(tid < row)

	{

		result[bid] += tab[bid*row + tid];

	}

}

In this exapmle tab is linear array 5x5 (col*row), which imitate a square array. I run kernel so there

is number of block equals to number of rows, and each block runs number of threads equals to number of columns.

As far as i understand the first kernel should put a “1” in each space of tab, and so it does.

The second kernel (kernelAdd) shoudl sum each piece space of row,a and add it to result in index that is current block procesign threads in that row.

Ok so in this example where we imitate matrix 5x5 each space is equals to 1, thats correct.

But after execution, ive got my result table as falows:

1 1 1 1 1, which by my calculations should be 5 5 5 5 5.

As far as i anderstend it works like there was only last one thread per block, or it doesnt change durning execution of block.

so if in first kernel i will put in something like this:

if(tid < row)

tab[bid * row + tid] = tid;

my correct result should be:

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

10 10 10 10 10

and in my case is:

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

0 1 2 3 4

4 4 4 4 4

And here is my question. What do im doing wrong? How do i fix his. Obviously i just dont understand something about thread/block execution.

Please help me.

Hi,

Do You know the term ‘race condition’? In the code sample above each of the executing threads in the block tries to change the same place in memory, which is ‘result[bid]’. One cannot say which of them does it first nor which last. To solve it I would use atomic operations (read about them in CUDA C Programming Guide, in appendix B, part 11).

Regards,

MK

The results will depend on the run. You should check the book Cuda by Example. They have a chapter on how to do this optimal. Using atomic operations will serialize you thread execution and result in a big decrease in performance.

Well, but this is sumarization. There is no dependencies betwen any of row cels, so it shoudent matter, except if all the threads would execute at the same exact time. But still that would not explain why does it always put walue from the last col as the result.

I was reading about atomic operations briefly, and my goal is to optimase code as much as posible, so i would like to avoid them.

Ok, made it with atomicks, hx for advice :)