Hi, All –
I’ve read the relevant parts of the manual and done some searching, but I’m still confused about the exact cost of a wrap serializing instructions (NOT memory reads).
When a wrap (or half-wrap on the 8000 series, right?) serializes an instruction because of a branch in one or more threads that does not affect all threads in the wrap, does that mean that the processor basically puts 31 threads on hold and executes one at a time until everything comes back together? Does that mean serialization presents a 31x speed hit to execution time? When my kernels serialize because of flow control it has never seemed that bad.
Also, how does CUDA decide when to re-parallelize the execution? I’ve noticed that I don’t have to __syncthreads() after every branch to make things run fast, but how intelligent is nvcc in that regard?
Thanks for any insights,
Oregon State University Graphics Group