Air cooling and thus thermal control at that altitude could indeed be an issue as air density is quite a bit lower (0.74 kg/m³ at 5000 m vs 1.2 kg/m³ at sea level), reducing the efficiency of any fan-based cooling. However, this may be offset by lower ambient temperature depending on the exact environmental circumstances of the deployment.
I think it would be best to get this question directly in front of relevant NVIDIA engineers. If your organization has a dedicated contact at NVIDIA (the context of telescopes located in the Atacama desert suggests that you work for such an organization) I would suggest following up on that path.
Another item besides cooling that may be of concern when operating GPUs at such altitudes is the higher incidence of cosmic rays causing increased error rates in RAM storage, which would suggest that only GPUs with ECC should be used to ensure reliable operation. I do not have any data for FIT rates at 5000m vs sea level, though.
[Later:] This recent paper (https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf) gives details of the FIT rates for two systems with comparable memory subsystems in Berkeley, CA and Los Alamos, NM. However, the altitude of Los Alamos is only about 7,300 ft, less than half the altitude of the telescope sites in the Atacama desert.
Thanks njuffa! My current institute is involved (in engineering) on just a small scale in these Atacama telescopes. For other projects we don’t use that many GPU either, actually, so have no dedicated NVIDIA engineer contact. Well, anway, will try to ask NVIDIA via some route.
Ambient temperature swings are greater (below freezing at night, beach temperatures at daytime) but the GPU hardware test installation is indoors/industrial container so the swings smoothe out a bit. Potentially we have air conditioning (but dependent on existing facilities and container space).
The FIT rate paper is interesting, thanks! In our application we just process digitized signals without conditional branching and errors in data are not very critical. Errors in instructions would be critical, though. Depends on GPU microarchitecture I guess. Are CUDA kernel instructions stored in main external RAM or are they in smaller dedicated on-chip memory (and if so is that memory more resilient to SEUs)?
Neutron flux should increase exponentially by increasing altitude, I think. Found JEDEC standard JESD89A refers to an online calculator (http://www.seutest.com/), the neutron flux at a Los Alamos altitude of 7,300 ft is 6 times that of sea level New York, while at 5000 meters is 36 times that of sea level. Perhaps with the ~15 FIT per month per DRAM at Los Alamos, we should be expecting 90 FIT at Atacama 5000m :P
CUDA programs are stored in the global memory of the GPU, that is, the on-board memory where all other data is held as well. There is an instruction cache in each SM which is fairly small, I think 4KB or maybe 8KB. Given that each GPU machine instruction comprises 8 bytes, that does not represent a whole lot of instructions, and the caches are typically just large enough to hold inner loops, not the entire kernel (unless the kernel is trivial of course).
I am not really familiar with how modern GPUs protect data held in on-chip structures, I think there is some protection for caches, TLBs, and register files, possibly in the form of parity bits, but I do not want to speculate. This would be a good question to ask NVIDIA for an authoritative answer.
I think it would be interesting for others here if you could briefly share your experience once you have these GPU-based systems running successfully, I seem to recall two or three other queries in years past asking about operating GPUs at significant elevations. I assume you are already well aware of potential issues with other computer components at high altitudes, such as hard disks.
Considering the “slight” price difference between Tesla and GeForce, unless any Tesla feature (other than ECC) is needed, one could instead run the same computation twice on GeForce cards (ideally concurrently on different boards); e.g. a GTX 980 cost at ~7-8x less than a K40 and will likely be faster (and run cooler).
Unless the data processing provides a built-in consistency check, running twice would be able to indicate to the user that an error has occurred, but not on which of the two runs. A two-out-of-three voting scheme would seem more robust.
Good point, thanks for the correction, after a mismatch in results (which can BTW be hard to detect without deterministic algorithms) a third run will indeed be needed. One can be lucky and be able to rule out incorrect results if the algorithms are such that errors propagate and a bit flip will more often cause Inf/NaN than not, but of course this won’t always be the case.