Recently I assembled a server with 4 GTX 480 cards to run CUDA without SLI. The direct VGA output was chosen by the motherboard as one of these GTX 480 cards. By this choice the NVIDIA X server could work from the chosen card, and I could obtain the temperature of each card from the NVIDIA X server settings window. Then I executed CUDA on all 4 GPU’s as an ordinary user on linux.
Something strange happened then. After I logged on the linux system through the direct VGA output next day, the GPU connecting to the VGA display (GPU 0 as determined by the NVIDIA X server) stopped working CUDA. The CPU controlling this GPU was still working. It seems to me that my intervention to the GPU stopped its CUDA execution.
In fact I could force the motherboard to display video from the onboard VGA port, and I expect the GPU could continue to execute CUDA without any problem. But this is not my preferred approach. If I rely on the onboard VGA display, which is not an NVIDIA chip, I cannot start the NVIDIA X server. Then I cannot know the temperature of each card, a crucial indicator of sufficient ventilation of my system.
So, is there any method that prevents the GPU from stopping this way and that allows me to check the card temperature? The direct control without using VNC server is important for me to monitor the system more conveniently.
Below is my hardware configuration for the server.
Tyan S7025WAGM2NR motherboard with Onboard Aspeed AST2050 for display
Intel Xeon E5620 CPU x 2 pieces
Kingston KVR1333D3E9S/4G RAM x 3 pieces (A total of 12 GB)
Hitachi 2 TB 7200 rpm hard drive x 2 pieces (A total of 4 TB)
EVGA GTX 480 graphics cards x 4 pieces
Silverstone 1,500 W PSU
Cooler Master HAF-X tower case
The operating system is Scientific Linux 5.5 with kernel version 2.6.18-194.26.1.el5 . The CUDA version is 3.2, not 3.2 RC.