__umul64hi timing

How fast is the __umul64hi function on the gpu? I am using it to divide integers and it take an extremely long time on some and shorter times on others. When I try to make a very simplified test case to isolate this behavior it goes away/ Has anyone else had any trouble with this intrinsic?