I believe I may have encountered a bug in the CUDA OpenCL stack. A trivial 2 line program generates different outputs when ran on 4 different devices. I have included a self-contained script to reproduce this here:
Apologies I’m not sure where to post this. Please advise on somewhere more suitable.