Could anyone benchmark this for me on a 780 (Ti) or Titan?

My guess is that a stock 780 Ti could do 400-450 kHash/s simply by extrapolating the performance of my GT 640 (GK 208 chip), but due to lack of testers I haven’t been able to confirm this yet

Find this program (Windows binary + source code) here:.

Linux sourcecode is best taken from github directly:

I do have kernels for all major CUDA architectures inside. It gets harder and harder to optimize them further. nVidia cards are still lagging behind AMD cards in terms of performance, but I’ve been able to close the gap somewhat ;) The high end AMD cards push numbers in the 800 kHash/s range.

The required command line options to launch this for benchmarking

cudaminer --benchmark

It will first auto-tune to find a suitable kernel launch configuration (this may take some time), and then it reports some kHash/s values in the console.

Kudos for the easy to run benchmark!

It’s still running on my GTX Titan, it reported 393 kHash/s .

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\hpc\Downloads\cudaminer-2013-11-20\cudaminer-2013-11-20>cudaminer.exe --benchmark
           *** CudaMiner for nVidia GPUs by Christian Buchner ***
                     This is version 2013-11-20 (alpha)
        based on pooler-cpuminer 2.3.2 (c) 2010 Jeff Garzik, 2012 pooler
               Cuda additions Copyright 2013 Christian Buchner
           My donation address: LKS1WDKGED647msBQfLBHV3Ls8sveGncnm

[2013-11-20 12:03:30] 1 miner threads started, using 'scrypt' algorithm.
[2013-11-20 12:03:30] Binding thread 0 to cpu 0
[2013-11-20 12:03:49] GPU #0: GeForce GTX TITAN with compute capability 3.5
[2013-11-20 12:03:49] GPU #0: interactive: 0, tex-cache: 0 , single-alloc: 0
[2013-11-20 12:03:50] GPU #0: Performing auto-tuning (Patience...)
[2013-11-20 12:07:02] GPU #0:  393.34 khash/s with configuration T28x18
[2013-11-20 12:07:02] GPU #0: using launch configuration T28x18
[2013-11-20 12:07:02] GPU #0: GeForce GTX TITAN, 16128 hashes, 0.08 khash/s
[2013-11-20 12:07:02] Total: 0.08 khash/s
[2013-11-20 12:07:02] GPU #0: GeForce GTX TITAN, 16128 hashes, 161.27 khash/s
[2013-11-20 12:07:02] Total: 161.27 khash/s
[2013-11-20 12:07:04] GPU #0: GeForce GTX TITAN, 806400 hashes, 310.85 khash/s
[2013-11-20 12:07:04] Total: 310.85 khash/s
[2013-11-20 12:07:09] GPU #0: GeForce GTX TITAN, 1564416 hashes, 313.93 khash/s
[2013-11-20 12:07:09] Total: 313.93 khash/s
[2013-11-20 12:07:14] GPU #0: GeForce GTX TITAN, 1580544 hashes, 313.27 khash/s
[2013-11-20 12:07:14] Total: 313.27 khash/s
[2013-11-20 12:07:19] GPU #0: GeForce GTX TITAN, 1580544 hashes, 314.14 khash/s
[2013-11-20 12:07:19] Total: 314.14 khash/s
[2013-11-20 12:07:25] GPU #0: GeForce GTX TITAN, 1580544 hashes, 312.84 khash/s
[2013-11-20 12:07:25] Total: 312.84 khash/s
[2013-11-20 12:07:30] GPU #0: GeForce GTX TITAN, 1564416 hashes, 314.25 khash/s
[2013-11-20 12:07:30] Total: 314.25 khash/s
[2013-11-20 12:07:35] GPU #0: GeForce GTX TITAN, 1580544 hashes, 309.89 khash/s
[2013-11-20 12:07:35] Total: 309.89 khash/s
[2013-11-20 12:07:40] GPU #0: GeForce GTX TITAN, 1564416 hashes, 314.44 khash/s
[2013-11-20 12:07:40] Total: 314.44 khash/s
[2013-11-20 12:07:45] GPU #0: GeForce GTX TITAN, 1580544 hashes, 313.89 khash/s
[2013-11-20 12:07:45] Total: 313.89 khash/s
[2013-11-20 12:07:50] GPU #0: GeForce GTX TITAN, 1580544 hashes, 312.59 khash/s
[2013-11-20 12:07:50] Total: 312.59 khash/s
[2013-11-20 12:07:55] GPU #0: GeForce GTX TITAN, 1564416 hashes, 313.93 khash/s
[2013-11-20 12:07:55] Total: 313.93 khash/s
[2013-11-20 12:08:00] GPU #0: GeForce GTX TITAN, 1580544 hashes, 311.36 khash/s
[2013-11-20 12:08:00] Total: 311.36 khash/s
[2013-11-20 12:08:05] GPU #0: GeForce GTX TITAN, 1564416 hashes, 313.24 khash/s
[2013-11-20 12:08:05] Total: 313.24 khash/s
[2013-11-20 12:08:10] GPU #0: GeForce GTX TITAN, 1580544 hashes, 314.08 khash/s
[2013-11-20 12:08:10] Total: 314.08 khash/s
[2013-11-20 12:08:15] GPU #0: GeForce GTX TITAN, 1580544 hashes, 313.09 khash/s
[2013-11-20 12:08:15] Total: 313.09 khash/s
[2013-11-20 12:08:20] GPU #0: GeForce GTX TITAN, 1580544 hashes, 313.96 khash/s
[2013-11-20 12:08:20] Total: 313.96 khash/s
[2013-11-20 12:08:25] GPU #0: GeForce GTX TITAN, 1580544 hashes, 314.21 khash/s
[2013-11-20 12:08:25] Total: 314.21 khash/s
[2013-11-20 12:08:30] GPU #0: GeForce GTX TITAN, 1580544 hashes, 309.65 khash/s
[2013-11-20 12:08:30] Total: 309.65 khash/s
[2013-11-20 12:08:35] GPU #0: GeForce GTX TITAN, 1548288 hashes, 312.64 khash/s
[2013-11-20 12:08:35] Total: 312.64 khash/s
[2013-11-20 12:08:40] GPU #0: GeForce GTX TITAN, 1564416 hashes, 313.49 khash/s
[2013-11-20 12:08:40] Total: 313.49 khash/s
[2013-11-20 12:08:45] GPU #0: GeForce GTX TITAN, 1580544 hashes, 314.52 khash/s
[2013-11-20 12:08:45] Total: 314.52 khash/s
[2013-11-20 12:08:50] GPU #0: GeForce GTX TITAN, 1580544 hashes, 313.33 khash/s
[2013-11-20 12:08:50] Total: 313.33 khash/s
[2013-11-20 12:08:55] GPU #0: GeForce GTX TITAN, 1580544 hashes, 312.65 khash/s
[2013-11-20 12:08:55] Total: 312.65 khash/s
[2013-11-20 12:09:00] GPU #0: GeForce GTX TITAN, 1564416 hashes, 313.56 khash/s
[2013-11-20 12:09:00] Total: 313.56 khash/s
[2013-11-20 12:09:05] GPU #0: GeForce GTX TITAN, 1580544 hashes, 314.14 khash/s
[2013-11-20 12:09:05] Total: 314.14 khash/s
[2013-11-20 12:09:10] GPU #0: GeForce GTX TITAN, 1580544 hashes, 313.77 khash/s
[2013-11-20 12:09:10] Total: 313.77 khash/s
[2013-11-20 12:09:15] GPU #0: GeForce GTX TITAN, 1580544 hashes, 314.08 khash/s
[2013-11-20 12:09:15] Total: 314.08 khash/s
[2013-11-20 12:09:20] GPU #0: GeForce GTX TITAN, 1580544 hashes, 313.71 khash/s
[2013-11-20 12:09:20] Total: 313.71 khash/s
[2013-11-20 12:09:25] GPU #0: GeForce GTX TITAN, 1580544 hashes, 309.23 khash/s

Btw, I also have a K20 but I doubt it would improve performance further.

I don’t know much about mining but this seems like pretty good performance for NV GPUs :-)

The results reported by the autotuning procedure and the actual results achieved during mining show quite a substantial difference. So in the end the program only achieves 313 kHash/s.

I have yet to figure out why autotune’s results are too optimistic.

Thanks for running the tests!

No problem, will be happy to run more tests if you do further code updates. Make sure to PM me as I don’t check the forums as often as I used to. :-)

I found Titan to work best using “-l K14x24 -C 1” in the command line. But I haven’t played around with it too much. And also, DP enabled makes it run hotter and slower.

At stock 1006/6000 (boost clocks) I get ~405 kH/s (360 kH/s DP enablerd)
and at 1150/6000 (overclocked), I get ~455 kH/s (420 kH/s DP enabled)

Unfortunately, the power draw really ruins any chance I have of running Titan was a miner. But, cudaminer is a great start to making GPU mining more competitive for the green team.

those doing benchmarks, maybe using the flag -H 1 will remove a CPU limitation. The CPU is doing SHA256 hashes before and after CUDA runs the scrypt core kernels. And if -H 1 is not given, the CPU part runs only on a single core - potentially being a limiting factor for the kHash/s values reported after autotuning. -H 1 enables the use of parallel_for constructs to distribute the workload across all cores.

the benchmarking should therefore be called with

cudaminer -H 1 --benchmark

Also you can enable one of two caching options using -C 1 and -C 2 when trying the Kepler kernel. the Titan kernel automatically caches its global memory access by means of the __ldg() intrinsic and it ignores the -C option

It is surprising that the Kepler kernel may be fastest on a Titan card, as on my GK208 based card (Compute capability 3.5) the Titan kernel wins with a notable performance edge - it makes direct use of the funnel shifter.

For real mining work for Litecoin or other crypto coins you can also pass the -i 0 flag which utilizes the card nearly 100%. The default is to leave a millisecond of sleep time between kernels, to allow for some display interactivity (assuming the GPU also drives a monitor).

That’s what I was doing. I was passing the -i 0 flag and running it on my second Titan. I forgot to mention that so my numbers are with the -i 0 flag.


cudaminer -H 1 --benchmark

I saw some improvement on my machine, ~350 kHash/s .

Thanks for the update Jimmy .

I ordered a GTX 780 Ti. I feel confident that I can get this beast to output 400kHash/s minimum. While it’s still not a good investment for cryptocoin mining (ATI cards rule), it certainly is the sexiest CUDA card out there, if you don’t care for double precision arithmetics or 6GB of memory. And I do a lot of other programming in CUDA and OpenGL, and some gaming as well.

BTW: one litecoin = 7 EUR currently. Sweet. Graphics cards paying for themselves - that is nice.


I rarely work in double precision. I guess the only thing I would miss would be the 6 GB of memory but let’s not get too spoiled. :-)

Btw, I was looking at some die shots of the SMX on GK104 cores and GK110, there really seems to be a significant size difference. I wonder if we’ll continue to see this hardware divergence into compute and gaming.

Probably, as long as GPUs push the die size envelope. A smaller die lowers the defect probability for any given die, reducing losses. Die harvesting by disabling defective SMXs helps some, but I don’t think you can beat having the smallest die possible. As long as compute applications are willing to pay a price premium, they will get the big chips. :)

On the other end of the compute scale, where die area is so small as to not matter much, I find it interesting that thermal cap is the new product differentiator. The iPhone 5s, iPad mini, and iPad all use the same CPU/GPU chip now, and differ in the configuration of their maximum thermal output. Their max clock rates are nearly identical (1.3 GHz for iPhone, 1.4 GHz for both iPads), but the dynamic clock rates vary to stay within the power limitations of the particular device.

With GPU Boost now in Tesla, I think we are solidly in the “constant power, variable clock” era. I fully expect to see the Visual Profiler start rating the energy efficiency of our programs within 3-5 years. :)

The constant power, variable clock also played a role in my optimization of the Kepler compute kernel.

I was able to optimize out some redundant arithmetic operations (array indexing computations) that were situated inbetween memory load/stores. These did not really slow down the execution because the load/stores had large latencies. However these redundant instructions consumed additional power. With these optimized out the GPU had more power to spare before reaching the power limitation and hence clocked up more on average. So I did get my speed gain after all.

This optimization benefit was not seen in Fermi and Legacy kernels because these GPUs do not have the same dynamic clocking based on power cap.

Hah, that’s awesome! Now I really want the tools to help me maximize the average clock rate on Kepler…

My GTX 780 Ti reaches 430 kHash/s with cudaminer ;)

Maybe more if I add some overclocking.

That’s great! Now we can add yet another variable to our N-dimensional optimization space :-)

I expect future Nsight and the visual profiler version to profile power consumption during kernel execution and also give me hints about which instructions consume more/less power ;-)


I tried using the Kepler SHFL instruction to replace shared memory, but so far I have not been able to get the hash rates any higher.

But with some overclocking I can now get the 780Ti to output 480 kHash/s.

Which is why I am now building a proof-of-concept Megahash mining rig using all CUDA cards.

2 x GTX 780Ti
1 x GT 640 with DDR5 RAM

the thing should at least pay for itself, and in the long run maybe even make a bit of profit.
So essentially I will get two high end cards for free - which are still useful even when the mining craze has subsided.

I expect it to draw 550-600 Watts from the wall, producing 1030 kHash/s. Nearly 2 kHash/s per Watt.

And they said nVidia cards sucked for Litecoin mining.

Mining currently has a HUGE impact on the GPU market. AMD cards selling out everywhere. nVidia to the rescue. ;)

Great job Christian!

It seems the best AMD Litecoin miners are between 1.9-3.16 Kh/Watt so your rig sounds like it will be really competitive!

Is it correct that Litecoin is still profitable on GPUs while bitcoins isnt any longer?