why did you reverse the sense of x and y between your row and col variables between the shader and cuda versions? That kind of thing matters for performance, if nothing else.
In CUDA, for performance, we usually want to associate row indexing with y grid variables, and column indexing with x grid variables. I wouldn’t be surprised if a shader has a similar sensitivity.