My CUDA kernels, which calculate double values, produce results consistent with the reference values when I run it under Visual Studio (debug and release), standalone or with Nsight Compute. However when I profile with Nsight Systems, I am getting a ton of errors:
...
at 3357224 -3.14949 should have been -2.67323
at 3357225 -3.14752 should have been -2.67126
at 3357226 -3.14776 should have been -2.6715
at 3357227 -3.14772 should have been -2.67146
at 3357228 -3.14823 should have been -2.67197
at 3357229 -3.14669 should have been -2.67042
at 3357230 -3.14863 should have been -2.67237
at 3357231 -3.14981 should have been -2.67354
at 3357232 -3.15097 should have been -2.6747
at 3357233 -3.15086 should have been -2.67459
at 3357234 -3.14987 should have been -2.6736
at 3357235 -3.14975 should have been -2.67349
at 3357236 -3.15079 should have been -2.67453
at 3357237 -3.15018 should have been -2.6739
...
I observe this when using shuffle instructions, the kernel that does not use shuffle is not experiencing this.