I assume there might be a problem with the way I am using the shared memory, but I really can’t see what is wrong, especially because the CUDA environment does NOT return any runtime error.
I compiled my program with nvcc -m 64 -arch compute_13 -code sm_13 -o post post.cu. My machine runs Fedora 10 (x86_64) with gcc 4.3.2 on an Intel i7 (2.67GHz) with 6GB of RAM and two GPUs: a GeForce GTX280, which is running the display, and a Tesla C1060. I am using the 2.2 toolkit with the 185.18.08 driver from the NVIDIA website. I attached the complete program file for reference.
I would gladly appreciate any hint. Thanks for your time! post.zip (1.28 KB)
I ran my application on devices in physically different machines, and it always failed the same way. So, I decided to pick up cuda-gdb and try debugging my kernel. I manually went through each single line, and when I ran line 69 the terminal running cuda-gdb halted beyond any hope of restoration (Ctrl-C wouldn’t work).
Am I using any software / hardware configuration (described in my previous post in the thread) known as being unstable? I tried many different CUDA applications, but all the ones that involve shared memory always failed on me. So, I am wondering if there could be a driver configuration error in all my machines (all Fedora 10, driver version 185.18.08) …
Thanks for answering, I really appreciate your attention to this matter.
If you comment line 99 (which is useless) and valgrind / memcheck the program, no memory errors are found:
cuda-dbg (program compiled WITHOUT -deviceemu) shows that the arrays are correctly inited and accessible before line 69, is that the behavior we expect if the pointers are pointing to nonsense areas? I tried to substitute lines 69 and 70 with constant value assignments (as in lfx[threadIdx.x] = 6;). Right after executing line 69, calling the debugger function print lfx[0] (for thread 0) reports an assigned value of 5.3671573228516165e-315. If I change all the declarations (both device and host) from double to float print lfx[0] has the correct value of 6, but the application still fails to perform operations (sums) on the shared memory.
I am really clueless about what is going on. The SDK examples run without reporting any errors, so apparently only my programs are faulty, but even the simplest program behaves incorrectly… is there any way to check whether the toolkit / driver install is sane? (but the SDK examples compiled, so I am puzzled).