Ok i must admit, im a bit frustrated, cant get this whole thing, and im circling around for definitly to long time.
My question is: how does blocks divide its task for the threads ? Ok i understand this is very confusing questiob so here is an example.
Normaly if i run my cernel liek this:
kernell<<<100,100>>(numThreads);
__global__ kernel(numThreads);
{
int bid = blockIdx.x;
int tid = blockIdx.x;
if(tid<numThreads)
...
}
OK so here, there is 100 blocks, each with 100 threads. Each block execute threads, until index is out of boundries. Tere is no loops like for, so threads should work simuletanously (in theory), without any dependencies ect. This is a normal case.
Ok so how about this?
kernell<<<100,100>>(numThreads);
__global__ kernel(numThreads);
{
int bid = blockIdx.x;
int tid = blockIdx.x;
for(int i=0; i<tid; i++)
var[i] = 1; //i know silly example but...
}
Now, how about my threads? Is this whole loop will be procesed by only one tread sequencionaly ? Does threads work here at all ? There is no kind of dependencies in this loop, so will each element be procesed by one thread?
I just cant get a straight answer for this.