experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ?

mar7mal · May 27, 2013, 6:27pm

Dear all,

I have recently bought two “EVGA GTX TITAN Superclocked” GPUs for scientific
calculations.

I did the first calculations (pmemd.cuda in Amber12) with systems around 60K atoms without any problems (NPT, Langevin), but when I later tried with bigger systems (around 100K atoms) I obtained “classical” irritating errors

cudaMemcpy GpuBuffer::Download failed unspecified launch failure

just after few thousands of MD steps.

So this was obviously the reason for memtestG80 tests.
( SimTK: MemtestG80 and MemtestCL: Memory Testers for CUDA- and OpenCL-enabled GPUs: Project Home ).

So I compiled memtestG80 from sources ( memtestG80-1.1-src.tar.gz ) and then tested just small part of memory GPU (200 MB) using 100 iterations.

On both cards I have obtained huge amount of errors but “just” on
“Random blocks:”. 0 errors in all remaining tests in all iterations.

------THE LAST ITERATION AND FINAL RESULTS-------

Test iteration 100 (GPU 0, 200 MiB): 169736847 errors so far
Moving Inversions (ones and zeros): 0 errors (6 ms)
Memtest86 Walking 8-bit: 0 errors (53 ms)
True Walking zeros (8-bit): 0 errors (26 ms)
True Walking ones (8-bit): 0 errors (26 ms)
Moving Inversions (random): 0 errors (6 ms)
Memtest86 Walking zeros (32-bit): 0 errors (105 ms)
Memtest86 Walking ones (32-bit): 0 errors (104 ms)
Random blocks: 1369863 errors (27 ms)
Memtest86 Modulo-20: 0 errors (215 ms)
Logic (one iteration): 0 errors (4 ms)
Logic (4 iterations): 0 errors (8 ms)
Logic (shared memory, one iteration): 0 errors (8 ms)
Logic (shared-memory, 4 iterations): 0 errors (25 ms)

Final error count after 100 iterations over 200 MiB of GPU memory: 171106710 errors

I have some questions and would be really grateful for any comments.

Regarding overclocking, using the deviceQuery I found out that under linux both cards run automatically using boost shader/GPU frequency which is here 928 MHz (the basic value for these factory OC cards is 876 MHz). deviceQuery reported Memory Clock rate is 3004 MHz although “it” should be 6008 MHz but maybe the quantity which is reported by deviceQuery “Memory Clock rate” is different from the product specification “Memory Clock” . It seems that “Memory Clock rate” = “Memory Clock”/2. Am I right ? Or just deviceQuery is not able to read this spec. properly in Titan GPU ?

Anyway for the moment I assume that the problem might be due to the high shader/GPU frequency. (see here : http://folding.stanford.edu/English/DownloadUtils )

To verify this hypothesis one should perhaps UNDERclock to basic frequency which is in this model 876 MHz or even to the TITAN REFERENCE frequency which is 837 MHz.

Obviously I am working with these cards under linux (CentOS 2.6.32-358.6.1.el6.x86_64) and as I found, the OC tools under linux are in fact limited just to NVclock utility, which is unfortunately out of date (at least speaking about the GTX Titan ). I have obtained this message when I wanted just to let NVclock utility to read and print shader and memory frequencies of my Titan’s:

[root@dyn-138-272 NVCLOCK]# nvclock -s --speeds
Card: Unknown Nvidia card
Card number: 1
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz

Card: Unknown Nvidia card
Card number: 2
Memory clock: -2147483.750 MHz
GPU clock: -2147483.750 MHz

I would be really grateful for some tips regarding “NVclock alternatives”,
but after wasting some hours with googling it seems that there is no other Linux
tool with NVclock functionality. So the only possibility is here perhaps to edit
GPU bios with some Lin/DOS/Win tools like (Kepler BIOS Tweaker, NVflash) but obviously I would like to rather avoid such approach as using it means perhaps also to void the warranty even if I am going to underclock the GPUs not to overclock them. Am I right ? So before this eventual step (GPU bios editing) I would like to have some approximative estimate of the probability, that the problems are here really because of the overclocking (too high (boost) default shader frequency).

This probability I hope to estimate from the eventual responses of another
Titan SC users, if I am not the only crazy guy who bought this model
for scientific calculations :)) But of course any eventual experiences with
Titan cards related to their memtestG80 results (also in connection with
eventual warranty claim) and UNDER/OVERclocking (if possible in Linux OS) are of course welcomed as well ! If any NVIDIA expert/developer will read this my contribution I would be grateful for “recommended/standard solution of such unpleasant situation”.

My HW/SW configuration

motherboard: ASUS P9X79 PRO
CPU: Intel Core i7-3930K
RAM: CRUCIAL Ballistix Sport 32GB (4x8GB) DDR3 1600 VLP
CASE: CoolerMaster Dominator CM-690 II Advanced,
Power:Enermax PLATIMAX EPM1200EWT 1200W, 80+, Platinum
GPUs : 2 x EVGA GTX TITAN Superclocked 6GB
cooler: Cooler Master Hyper 412 SLIM

OS: CentOS (2.6.32-358.6.1.el6.x86_64)
driver version: 319.17
cudatoolkit_5.0.35_linux_64_rhel6.x

The computer is in air-conditioned room with permanent external temperature around 18°C

Thanks a lot in advance for any comment/experience !

Best wishes,

Marek

mar7mal · May 28, 2013, 1:20am

Hi again,

Thanks to one valuable response in EVGA forum ( http://www.evga.com/forums/tm.aspx?m=1940998 )
I finally learned that except “CLOSED” variants of memtestG80/memtestCL which are
available here : SimTK: MemtestG80 and MemtestCL: Memory Testers for CUDA- and OpenCL-enabled GPUs: Downloads there are also
available “OPEN” variants which seem to be more up to date and which differ from “CLOSED” variants
at least in one thing: There was fixed sync error in random blocks test :))
here are the OPEN version src links:

memtestG80

here is the sync fix code
https://github.com/ihaque/memtestG80/commit/c4336a69fff07945c322d6c7fc40b0b12341cc4c

memtestCL

and fix code link
https://github.com/ihaque/memtestCL/commit/a7f25002cde6dc396a09870ec8b468cd9e3bd5ff

When I tested my factory OC TITANS with patched (OPEN) version of memtestG80 I obtained
0 errors !!!

For the moment (just few minutes ago) I have tested 5 GB of memory using 300 iterations
( ./memtestG80 -g 1 5000 300 ) with zero number of errors (on both GPUs).

So it seems that my original problem with particular MD calculation which inspired me to test my new OC Titan cards with memtestG80 do not have origin in GPU hard/soft errors.