Then use atan2() to map back to the full circle. Not sure whether that gives you the smaller or larger of two possible angles, and whether it will be properly signed (if that is required), so additional minor corrections may be necessary.

The magnitude computations would want to use rhypot() is you need robustness, otherwise direct use of rsqrt() will give you the best performance:

Sorry, I wasn’t thinking when I recommended rhypot(). That works great for 2D, but for 3D you need something like rhypot (a, hypot(b, c)) which is robust but could be too slow.

My main focus was to get you down from two transcendental function calls to one. I know that for double precision, acos() and atan2() have roughly the same throughput on current Teslas, so that seemed like a good tradeoff.

So computing sine and cosine first gives us a unit vector, and using the atan2() of that should give us the angle (possibly modulo minor corrections as mentioned). If you would not need the sine and cosine, but just the angle itself my approach could be even more efficient since we do not need the two divisions by |a|*|b| in that case (as they cancel) leaving us with:

theta = atan2 (norm(cross(a,b)), dot(a,b))

An interesting side-problem here is how to compute norm(cross(a,b)) with the minimal number of operations, by making maximal use of FMA. Is there a better way than the naïve method, while being numerically stable? I have not looked into that yet.

Well, that is the naïve implementation I referenced. The question is, can one do better than this?

In terms of accuracy and robustness use of the functions hypot() and rhypot() will help. As for the cross product, it is made up of terms of the form ab - cd which can suffer from cancellation even when FMA is used in the obvious way. A more accurate evaluation using FMAs (at the cost of doubling the number of required operations) is mentioned in this recent paper: