Incorrect "double" calculations when profiling with Nsight Systems

My CUDA kernels, which calculate double values, produce results consistent with the reference values when I run it under Visual Studio (debug and release), standalone or with Nsight Compute. However when I profile with Nsight Systems, I am getting a ton of errors:

...
at 3357224 -3.14949 should have been -2.67323
at 3357225 -3.14752 should have been -2.67126
at 3357226 -3.14776 should have been -2.6715
at 3357227 -3.14772 should have been -2.67146
at 3357228 -3.14823 should have been -2.67197
at 3357229 -3.14669 should have been -2.67042
at 3357230 -3.14863 should have been -2.67237
at 3357231 -3.14981 should have been -2.67354
at 3357232 -3.15097 should have been -2.6747
at 3357233 -3.15086 should have been -2.67459
at 3357234 -3.14987 should have been -2.6736
at 3357235 -3.14975 should have been -2.67349
at 3357236 -3.15079 should have been -2.67453
at 3357237 -3.15018 should have been -2.6739
...

I observe this when using shuffle instructions, the kernel that does not use shuffle is not experiencing this.

@mjain could this be a CUPTI issue?

Please provide a minimal reproducible, driver version, GPU model, and tools versions.

Turned out there was a subtle race condition and it manifested itself only when run under Nsight Systems and not under Nsight compute or standalone etc(why?)

Please provide sufficient information if you would like help. Please describe the race condition in sufficient detail and please provide the requested information and a minimal reproducible.