I was wondering if anyone had successfully implemented emulated double precision division on a CUDA device. I tried to follow the example set by dsfun90 but my device results still fail to match the results reached by full double precision. I have also tested the algorithm on the CPU using both device emulation and straight cpu computation while using -ffloat-store. When I do this, they answers turn out exactly correct. Any thoughts are welcome.