So I got my code, theoretically, up and running, but I just found out some odd behavior:
-the code gives me the correct result when I compile it with -G (-g is usually there, but is unimportant), but when I omit -G, the answer is around a factor of 2x as large as it should be. Has anyone experienced this?? What sit about adding -G that changes the answer?
I tried adding in some __syncthreads() and cudaThreadSynchronize() commands, thinking -G was doing that for me, but that did not change a thing.
And while I have your behavior, any thoughts on this: the code is completely written using double instead of float. To get it to run on sm 1.2 or below architectures, I figured nvcc’s automatic demotion of float to double would be just fine. Turns out it is not. The code only actually runs correctly when I run it on a sm1.3 card (didnt try a 2.0) -arch=sm_13.