Ok, im new to CUDA and despite ive read many tutorials, still cant get a clear picture how does threads change durning kernel execution. Here is a little example taht ive got a big problem:
__global__ void kernelInit(int *tab, int row)
{
int tid = threadIdx.x;
int bid = blockIdx.x;
if(tid < row)
tab[bid * row + tid] = 1;
}
__global__ void kernelAdd(int *tab, int *result, int row)
{
int tid = threadIdx.x;
int bid = blockIdx.x;
if(tid < row)
{
result[bid] += tab[bid*row + tid];
}
}
In this exapmle tab is linear array 5x5 (col*row), which imitate a square array. I run kernel so there
is number of block equals to number of rows, and each block runs number of threads equals to number of columns.
As far as i understand the first kernel should put a “1” in each space of tab, and so it does.
The second kernel (kernelAdd) shoudl sum each piece space of row,a and add it to result in index that is current block procesign threads in that row.
Ok so in this example where we imitate matrix 5x5 each space is equals to 1, thats correct.
But after execution, ive got my result table as falows:
1 1 1 1 1, which by my calculations should be 5 5 5 5 5.
As far as i anderstend it works like there was only last one thread per block, or it doesnt change durning execution of block.
so if in first kernel i will put in something like this:
if(tid < row)
tab[bid * row + tid] = tid;
my correct result should be:
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
10 10 10 10 10
and in my case is:
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
4 4 4 4 4
And here is my question. What do im doing wrong? How do i fix his. Obviously i just dont understand something about thread/block execution.
Please help me.