Well, I don’t know what the internals of the CUDA emulator look like, but you should know that there are very often going to be differences between code that does serial computations and code that does parallel computations. And it’s not that one is correct the the other isn’t…technically, neither of them is “correct” because they both introduce floating point errors into the calculations. In some ways, parallel code can be even more “correct” than standard CPU code because there are fewer areas where these errors can occur.
tl;dr – If there’s only a small difference between the Debug and EmuDebug results, I wouldn’t worry too much about it. The definition of “small” will depend on your application.