branch and precision

I implement a cuda program and find something are different from I thought before.




In programming guide2.0, it said “If threads of a warp diverge via a datadependent
conditional branch, the warp serially executes each branch path taken,
disabling threads that are not on that path, and when all paths complete, the threads
converge back to the same execution path.”.

However, if I remove _synctherads(), the result is wrong.

I use float type in cuda code and cpu code and generate a float array in both gpu coda anc cpu code. I compared the two arrays and find the largest error is 0.00001. Is it reasonable?

  1. Well, are you writing to a shared variable in the if? Without the __syncthreads() the other warps in the block could (and will) run past without waiting the if and read that variable later.

  2. Differences of ~10^(-5) are completely normal with floating point math. You can get the same differences on your CPU code just by changing the order of additions. Worse errors are of course possible if you aren’t careful (like subtract two almost identical numbers), but these are just issues for floating point calculation in general and not specific to the GPU.

Yes. I write a shared variable in If.

I think the other threads in the warp will wait for the thread 0. Because there is a write to a share variable, __syncthreads is needed to make the result to be visible to other threads.

In Guide,

" Only after the execution of a __syncthreads() (Section 4.4.2) are writes to

shared variables guaranteed to be visible by other threads."

What’s meaning about the changing the order of additions?

In floating point arithmetic a + b is NOT equal to b + a, so you will get different answers depending on the order of operations.

This is a good reference: