I would recommend reading the following whitepaper, if you haven’t had the chance to do so:
Not knowing anything about the code other than that it is double-precision code, the most likely cause for numerical discrepancies between CPU and GPU would be the merging of double-precision multiplication and addition into double-precision FMA (fused multiply add). You can turn that off by passing -fmad=false to nvcc, but this will likely reduce the accuracy and performance of the GPU code. Generally speaking, the use of FMA typically improves accuracy by reducing rounding and providing some protection from subtractive cancellation.
I am not exactly sure what you mean by “matching to within one digit after the decimal point”. Can you show an example pair of results? How many digits are there altogether, and how many match?
Depending how big the numerical differences are, they could also be due to a bug in the code. Other than a careful review of the code, make sure that the code checks the status of all CUDA API calls and kernel launches, and run the program under cuda-memcheck. Please be aware that on an sm_13 device it will be able to provide only very limited checking due to hardware limitations.