"Overclocking unfriendly" code

I discovered that certain sequences of device instructions greatly reduce a program’s overclocking potential (by 50~100MHz), despite that the card the program is being run on is far from being power throttled or thermal throttled, and programs with comparable or heavier arithmetic load could safely overclock more.

Specifically, the program causes random Unspecified Launch Failure errors and occasionally Illegal Memory Access errors to be thrown, without triggering the WDDM TDR.
The issue can sometimes be fixed via manual reordering of instructions and/or changing branching conditions without affecting the underlying logic, or it can be dependably fixed by reducing device code optimization to -Xptxas -O1 or below at the cost of performance.

The issue can be reliably reproduced on all of my Maxwell cards (one TITAN X and three 980 Tis, all ref cards).

If someone is interested, I’ve whipped up a (short but not minimal) sample VS2013 project with dummy input data which you can use to try to reproduce the issue.

#temporarily removed#

All my cards can overclock stably +300MHz without adjusting voltage, even at the verge of being power throttled (110% TDP) or thermal throttled (91 degrees Celsius), and operating over a long time period (~24 hrs) at ~97% arithmetic load. However for this sample, I have to drop the overclocking to ~+200Mhz for it to finish without errors being thrown.

Understandably overclocking stability is a fringe topic here, but I think it’s an interesting issue to look at nonetheless.

Note that I have only tested this for CUDA 7.5RC and sm50/sm52 code generation, as the code needs the LOP3.LUT instruction exposed in the PTX instruction set. You’ll need a Maxwell card to run it.

I am using Windows 8.1 64bit with GeForce Driver version 353.38. The program should be targeting 64bit platform.

Interesting, thanks for posting a tasty bit of sample code.
So this can only run on 7.5 RC because of the LOP3.LUT instruction?

Generally I buy the factory overclocked versions of cards with the ACX coolers and leave it at that level, so never had such an issue.

To be fair that is some serious overclocking, but I am glad someone is pushing the limits.

Will download 7.5 RC today and give the code sample a go…

Yes, the LOP3.LUT was exposed as lop3.b32 in PTX in CUDA 7.5RC. I might have isolated the issue, I’ll upload the latest test code later (I’ve actually created a buggy test case).

The signal paths in any processor are not all created equal. Some are tighter than others. Early on in the life of a new device the vendor characterizes the silicon to find the slowest paths, adds an engineering margin and bases the frequency and other specifications of the reference device on that. Occasionally it happens that some speed path is missed in initial analysis, and part specifications have to be adjusted accordingly (e.g. slightly higher voltage or reduced operating temperature) to guarantee proper device function at the specified parameters.

What is the purpose of the engineering margin? There are two major reasons: The silicon manufacturing process, consisting of numerous chemical and mechanical steps, has some variability. With feature sizes that can be measured in atom lengths, it is easy to be off by a few atom lengths in one direction or the other. This means there will be variability between speed paths across individual parts of the same model. The second reason is that semiconductor devices slow down as they age. Wires may thin through electromigration, increasing resistance and slowing signal propagation, transistor switching speed may slow down due to trapped charge. The hotter a part runs the faster aging occurs (see Arrhenius equation). The engineering margin applied is a function of anticipated manufacturing variability and anticipated service life of the component, say 5 years for a processor.

Overclocking eats into the frequency margin designed into vendor specifications, and since it increases current and often temperature as well, leads to faster aging of the semiconductor device. Since the speed paths are not publicly documented, any number of operations or sequences of operations can hit such a speed path and cause software to fail. The failure may be obvious (crash, blue screen, kernel panic) or very subtle and therefore may go unnoticed for a long time.

Back when I was a poor student, I needed performance but could not afford top-of-the line hardware. So I became an avid overclocker of CPUs. Of course I would “stress test” my overclocked parts to make sure everything worked properly despite running in excess of manufacturer specifications. Several years into this, I found strange deviations in a floating-point intensive simulation code. It took me forever to track this down to an FSQRT instruction that occasionally delivered results where a few bits were flipped. Further testing showed these failures to be operand dependent, but clearly due to overclocking. Ever since then, I have taken an extremely critical view of overclocking, whether by end users or by “factor overclocked” hardware. The aggravation of having an application fail in subtle ways that may not be noticed for months is not worth it, IMHO. Other people take a different stance by observing that failures (both severe and subtle) are more likely to originate in software rather than hardware.

To summarize: Speed paths in processors are at minimum dependent on voltage, temperature, age of the processor, instructions or instruction sequences, instruction operands, and noise from manufacturing tolerances. There is no sure fire way for an end user to know when overclocking is completely safe, nor is it generally possible to incorporate knowledge about speed paths into a compiler’s code generation (e.g. tweaks to the manufacturing process over the life time of a chip can change them). In my experience compiler code generation is primarily oriented towards performance under the “race-to-finish” model as measured for example by clock cycles, and secondarily to reduce power consumption, where this second goal is something added over the past decade.

I have identified the lines of codes that causing the issue and restructured to code to make the program more stable. Sadly converting it to a reproducible short sample seems non-trivial. And while I still don’t know the exact cause, this will have to do for now.

njuffa, thank you very much for posting this and sharing your experience. It explains a great deal concerning processor manufacturing and caveats concerning overclocking. Operand dependent failures are exactly what I have experienced which took a me a long time to track down as well. In my case I found that simply using a different salt value could render my DES crypt(3) kernel extremely unstable when the device is overclocked. While this project can tolerate data error, I would definitely think twice now before using overclocked hardware for computation tasks whose result verification isn’t easy.

I am now fairly certain that excessive register spilling (>~60 bytes) coupled with high compute throughput is one of the key components in potentially causing the issue. Still, the actual failure could very well be operand dependent. I have not experienced the issue even once when there is no or low register spilling. While as njuffa said, it’s not generally possible for compiler to optimize the code for a specific piece of silicon, the programmer might be able to design the algorithm so that routines or sequences of codes that are more sensitive to overclocking, once identified, are not present in the actual program. In my case, I’ve restructured my code for lower occupancy to reduce register pressure and used branching to substitute arithmetic instructions for data movement to reduce arithmetic load.

Regardless, I’ve found Maxwell (or at least GM200) to be a great overclocking platform. It’s also possible to use overclocking to cover lost arithmetic load (due to low heat stall reasons other than a busy arithmetic pipe, such as instruction fetch, execution dependency or synchronization) of a low-arithmetic kernel, in which case one can achieve a much higher clock than one can normally do with a high-arithmetic kernel, provided that the card itself isn’t throttled by power, temperature or voltage. I was able to do something like this at stock voltage with the modified algorithm on one of my cards:
That’s 30% extra performance! Despite the lower arithmetic load, the program is now faster and more stable than my old one.

Finally, one can not stress enough the downsides of overclocking, namely the reduced life-span and program failures, both catastrophic and silent ones. One could argue and I discreetly agree that the lowered life-span commensurate adequately to the depreciation of the device (due to technology advancement) if done right. Silent program failures can be eliminated in algorithms where result verification is much cheaper than result generation. Brute-force password cracking (simply check the hashed password against the hash) and optimization (simply evaluate the objective function) come to mind. On the other hand, simulation, linear algebra and space partitioning data structures etc. are very prone to data error and subsequently overclocking. I believe that when used with prudence and proper judgement, overclocking can be one of the most effortless ways of accelerating a program further.

I did some more testing and it seems this issue is related to https://devtalk.nvidia.com/default/topic/390313/unspecified-launch-failure-from-quot-volatile-quot-adding-quot-volatile-quot-causes-random-ulf/

I’ve now forgone the usage of volatile in favor of inline PTX to tell the compiler exactly what to do.

I predict you will soon be using scottgray’s Maxwell SASS assembler :-)

Thank you for mentioning that; I didn’t even know it existed.