Are loop incrementations performed by all threads?

I may have missed this in the Manual (or in previous discussion here)… it seems like an important issue.

If I have a for loop on the device, eg:

global void kernel(){
int k;

for(k=0;k<5;k++){
}
}

I would imagine that this would cause a problem, as all of the threads within the block would try to increment k, or does the compiler do something clever when it sees a something like this?

Related to this is something else that I have often done in my code, but I would just like to confirm that it is necessary (and that it is not a silly way to do it):

If I have something in the kernel that doesn’t need to be done in parallel by all of the threads in the block, eg. if I am changing what some pointer points to, then I usually say:

if(threadIdx.x==0){
p1=p2;
}

Thanks

You missed that in manual, all values which are not declared as shared constant etc are not shared between threads. So every thread would do exacly 5 iterations.

About the second question, probably doing that in all threads will be more effective as long as you don’t have an great amount of bank conflicts. First you will have no if’s there and second, the operations inside if are done for all threads, but some just discard the result ( in the example threads with threadIdx.x != 0 would throw away their results )

Your variable int k is local to each thread in a register. There will be no race conditions as each thread will be incrementing a separate register. This is true of any variable declared normally in the kernel.

Your worries about race conditions only apply if you declare the variable shared as it will be the same shared variable across the entire block.

If I have something in the kernel that doesn't need to be done in parallel by all of the threads in the block, eg. if I am changing what some pointer points to, then I usually say:

if(threadIdx.x==0){

p1=p2;

}

That’s OK if p1 is a shared variable, but don’t forget the __syncthreads()!!! Otherwise some threads might continue past this point before thread 0 makes the assignment.

In simply situations like updating pointers, it is usually simpler and can be faster just to have every thread do it in their own register and avoid the __synchthreads(). That way, all threads are running independently and aren’t stalled waiting for the others on ever pointer update.

One situation where it is good to use the if (threadIdx.x == 0) construction is when you need to read a single value from global memory that the entire block will use. Having each thread perform that read will slow things down immensely.

Ah, I see. Up until now I thought the register memory space was pretty similar to the shared memory… Woops!

So I suppose that, unless I am worried about running out of register memory space, then it is faster, like you said, to have copies of variables like k and the pointers p1 and p2 in my example above for every thread, rather than putting them in shared memory.

Many thanks