Better understanding of hardware issues

I am developing cuda code for a statistical algorithm on a GTX 580 on linux. Based on some seemingly-random errors that I see in my code, I’m suspecting that my system might not be set up properly. Now, I’m a software programmer and have been staying away from hardware-related issues, but it appears to me that some understanding of hardware is necessary in the gpu world, even as a scientific computing programmer.

What is the best way for someone like me to self-teach the necessary hardware aspects of my environment, so that I can be self-sufficient in diagnosing and solving basic hardware-related problems? I’m even willing to consider training course, both online and otherwise.

Thank you,
Alireza Mahani

I would investigate a race condition or memory error in your code first. Nearly all of my CUDA problems turn out to be software-related. As a baseline, I always run cuda-memcheck before investigating other things.

(For example: I recently went hunting for the cause of what appeared to be a performance-related bug. Randomly, the computer would get into a state where my kernel would take 10x longer to complete than normal. I started swapping out cards, moving slots, chasing each and every spurious correlation between this bug and the hardware configuration. Finally, I ran cuda-memcheck and discovered that I was accessing unallocated memory while traversing a tree-structure. It didn’t affect any of my test cases, but it sent the kernel on a random walk through memory before it hit the termination condition. All of my hardware-related hypotheses were wrong.)