why emurelease mode's result differs from CPU version?

I have a problem that when I declared only one block per grid and one thread per block, the result of emurelease mode still differs from the CPU version of that program and I am sure that I have transplanted the algorithm correctly to GPU. Then I carefully examined their results and found that their results are approximately the same but are not exact the same, which is the case that I thought might happened between release mode and CPU version. Given the fact that emu mode is actually implemented on CPU, I simply wanna know why their results are diffirent.