a question that gives me no rest ))
cuda provides __usad(x,y,z) intrinsic that computes |x-y| + z and
compiles to a single instruction according to decuda.
does this essentially mean that it executes in 4 clock cycles ? so we basically get three arithmetic operations at the cost of one ?
or it has larger latency ? note that cuda’s programming guide does not make it clear