Better understanding of hardware issues

asmahani · August 1, 2011, 5:20pm

I am developing cuda code for a statistical algorithm on a GTX 580 on linux. Based on some seemingly-random errors that I see in my code, I’m suspecting that my system might not be set up properly. Now, I’m a software programmer and have been staying away from hardware-related issues, but it appears to me that some understanding of hardware is necessary in the gpu world, even as a scientific computing programmer.

What is the best way for someone like me to self-teach the necessary hardware aspects of my environment, so that I can be self-sufficient in diagnosing and solving basic hardware-related problems? I’m even willing to consider training course, both online and otherwise.

Thank you,
Alireza Mahani

seibert · August 3, 2011, 2:59pm

I would investigate a race condition or memory error in your code first. Nearly all of my CUDA problems turn out to be software-related. As a baseline, I always run cuda-memcheck before investigating other things.

(For example: I recently went hunting for the cause of what appeared to be a performance-related bug. Randomly, the computer would get into a state where my kernel would take 10x longer to complete than normal. I started swapping out cards, moving slots, chasing each and every spurious correlation between this bug and the hardware configuration. Finally, I ran cuda-memcheck and discovered that I was accessing unallocated memory while traversing a tree-structure. It didn’t affect any of my test cases, but it sent the kernel on a random walk through memory before it hit the termination condition. All of my hardware-related hypotheses were wrong.)

Topic		Replies	Views
which took can check GTX 760 memory hardware error CUDA-MEMCHECK	0	1968	February 17, 2015
CUDA code randomly works, and returns wrong results CUDA Programming and Performance	9	1657	July 2, 2021
Hardware failure following invalid memory access an expensive problem... CUDA Programming and Performance	3	2615	October 27, 2009
Hardware or software problem? CUDA Programming and Performance	3	802	July 13, 2011
same code gives different results on two Nvidia 2080Ti GPU CUDA Programming and Performance	7	1579	November 2, 2019
Runtime trouble moving legacy code from CUDA 6.5 to 8.0 CUDA Programming and Performance	7	708	July 5, 2021
My GPU code works well in cuda-memcheck model but crashes in normal running model CUDA Programming and Performance	2	760	October 9, 2015
Memory Test on a PC GPU CUDA Setup and Installation	7	3209	May 22, 2019
'cuda-memcheck' works on one machine, does not work on some others CUDA Setup and Installation	0	641	May 8, 2018
CUDA memcheck 1.x CUDA Programming and Performance	4	976	January 9, 2014

Better understanding of hardware issues

Related topics