Could anyone benchmark this for me on a 780 (Ti) or Titan?

Recently, Litecoin mining suddenly became profitable again when this happened (see Nov 18 and Nov 27/28).

[url]Cryptocurrency Prices, Charts, and Crypto Market Cap | CoinGecko

The price jumped from a few dollars to above 30 dollars per coin and has been holding up great so far. Now a lot of people invest in new mining hardware (mostly AMD GPUs). There is no competition from FPGA and ASIC chips yet (and from an engineering standpoint I would say: this will last). Designing ASICs with big fat memory pipes is kind of hard.

Bitcoin mining on GPUs hasn’t been profitable for a long time. FPGAs and ASICs killed that idea.

Ah, so Litecoin is actually a more bandwidth bound problem. Not exactly strong sides of FPGA/ASIC .

David Andersen has come up with more efficient CUDA code than I did.

cudaminer gets 150kHash/s on a Compute 3.0 device (amazon EC2 instance) while his miner gets 220 kHash/s. EDIT: so far he has disclosed only that it’s targeting Kepler, using the warp shuffle instruction.

I am currently peaking at 480kHash/s on a GTX 780Ti (with some overclocking), and I have also been able to reduce the CPU load to near zero. With David’s changes these critters could exceed 600 kHash/s, becoming serious rivals to AMD cards!

My dedicated mining rig with 3x 780Ti cards build is currently not functioning because I do not yet have an adequate power supply (the affordable ones all seem to be on back order). I currently run two out of three cards in separate PCs.


David Andersen’s work has boosted the hash rate of cudaminer to 550 kHash/s on my slightly OC’ed 780Ti. I’ve seen some extreme overclockers report 650 kHash/s. That’s way in AMD’s territory, there.

AMD cards are still cheaper to acquire, though. So they will keep their edge over nVidia for litecoin (scrypt based) mining.

The biggest speed gains were for Compute 3.0, devices - 50% gain in some cases. Ah, the power of the SHFL instruction.

Very cool.

Have you used any of the CUDA profiling tools yet?

briefly using the CUDA visual profiler from the CUDA 5.0 SDK. But I got confused by all the new counters that were introduced since I last tried this on Compute 1.x and Compute 2.x devices. I’ve since upgraded to CUDA 5.5 - and maybe I should try this again. I heard the profiler was greatly enhanced since.

One worthwhile thing to try would be to replace shfl with shared memory to bring Dave Andersen’s principal design to Fermi and older devices. Maybe there’s still some speed-up to be discovered…


In case someone is interested. You can do serious coin mining with nVidia:

I present my 1.65 MHash/s miner using 3 nVidia GTX 780Ti cards - 850 Watts power draw from the wall. Running Kubuntu 12.04. This build is a bit noisy with two out of 3 GPUs running at 90% fan speed (airflow needs improvement)

Mainboard: Asrock Z87 Fatal1ty Killer (a gaming mainboard with 3 PCI express x16 slots), CPU: Intel Core i3-4130T, LGA1150, PSU: Aerocool GT-1050S CM 1050W ATX

The mainboard wasn’t cheap, but it may later become the basis for my next desktop PC build.
Two more x1 PCI-x slots are available, for which I could use powered risers to add more hashing power.

If I were to build another mining rig, I would probably use GTX 780 cards instead, and run windows to overclock them.
Gives same or better performance at less cost.

External Media

Very cool. That’s 15+ TFLOPS before overclocking!

I wish there was an overclocking option for Fermi and Kepler cards on Linux.
Currently I have to use a modded video BIOS (for some extra 40kHash/s per card).

“A new profiling feature in CUDA 5.5 allows you to profile the clocks, power, and thermal characteristics of the GPU as it executes your code. This feature is available in the NVIDIA Visual Profiler on Linux and 64-bit Windows 7/8 and NSight Eclipse Edition on Linux. Learn how to activate and use this feature by watching CUDACasts Episode 13.”

By the way, the latest cudaminer github version also does scrypt-jane hashing (for Yacoin, QQCoin) and it beats the current mining software for AMD GPUs by quite a margin.

By the way, I’ve received an unsolicited code submission from nVidia that boost’s Compute 3.0 devices by ~13% and Compute 3.5 devices by 20%.

I am now mining 1.88 MHash/s on 3 GTX 780Ti cards, each one doing about 625-630 kHash/s. Sweet. With extra overclock they could do more, but Linux limits my overclocking options…

The code is available in the cudaminer github repo for anyone interested in scrypt or scrypt-jane cryptocoin mining. The respective optimized kernels have been named “Y” and “Z”… until I find a better naming system ;)

The nVidia engineer took my test_kernel code (which used __shfl() based transposition) and made it work much better. Seems I was on the right path when trying the shfl instruction, but I had stopped short of producing something useful.

AMD cards are no longer significantly faster, just significantly cheaper. So could nVidia lower the price please? That’d be great… ;)


Awesome! It is all about the __shfl()…



I am using for the scrypt I got 470 KH/s

when on a Compute 3.0 device, pass -l Y command line option, when on Compute 3.5, pass -l Z

The truth is I posted before reading all posts. right now I have the Cuda-master version from git and when I use -l Z flag on ubuntu I get segmentation fault. I will read the previous posts in a few days and check what I did wrong. I think I have the wrong cudaminer version.

Edit. I just realized I the enw version was submitted just a a day ago. So I will downsload the new version and try the new kernels. I have a Titan and access to 660 Ti cards for testing.

Update: I downloaded the code from the git, compiled like with the usual; ./; ./configure; make , but when I run it I get this warning:

GPU #0: Given launch config ‘Z’ does not validate then it stays a long time autotuning and I get next

GPU #0: Performing auto-tuning (Patience…)
[2014-01-25 21:33:09] GPU #0: maximum total warps (BxW): 1484
[2014-01-25 21:40:29] GPU #0: 460.42 khash/s with configuration Z56x24

Strange I at the second run I got with Z14x20 518 kH/s

I used flags -d 0 -i 0 -H 1 -l Z --no-stratum

This is more or less the same. I am doing something wrong, but I can not figure what.

with the 12 SMX on a GT 780 you might want to use -l Z12x24 -i 0 -H 1
on a Titan use -l Z14x24 -i 0 -H 1
and on a 780Ti use -l Z15x24 -i 0 -H 1

According to the author who submitted the code, #SMX x 12 is the best
launch configuration for this kernel. autotune might not always find it.

-H 2 would lower your CPU use and offload SHA256 hashing to the GPU as well.
Not sure about the effect on hashing speed.


./cudaminer -d 0 --benchmark -l Z14x24 -i 0 -H 1
*** CudaMiner for nVidia GPUs by Christian Buchner ***
This is version 2014-01-20 (beta)
based on pooler-cpuminer 2.3.2 (c) 2010 Jeff Garzik, 2012 pooler
Cuda additions Copyright 2013,2014 Christian Buchner
My donation address: LKS1WDKGED647msBQfLBHV3Ls8sveGncnm

[2014-01-26 01:12:26] 1 miner threads started, using ‘scrypt’ algorithm.
[2014-01-26 01:12:26] GPU #0: GeForce GTX TITAN with compute capability 3.5
[2014-01-26 01:12:26] GPU #0: interactive: 0, tex-cache: 0 , single-alloc: 0
[2014-01-26 01:12:26] GPU #0: 32 hashes / 4.0 MB per warp.
[2014-01-26 01:12:26] GPU #0: using launch configuration Z14x24
[2014-01-26 01:12:26] GPU #0: GeForce GTX TITAN, 239.14 khash/s
[2014-01-26 01:12:26] Total: 239.14 khash/s
[2014-01-26 01:12:28] GPU #0: GeForce GTX TITAN, 604.62 khash/s
[2014-01-26 01:12:28] Total: 604.62 khash/s
[2014-01-26 01:12:33] GPU #0: GeForce GTX TITAN, 607.83 khash/s
[2014-01-26 01:12:33] Total: 607.83 khash/s
[2014-01-26 01:12:38] GPU #0: GeForce GTX TITAN, 608.31 khash/s
[2014-01-26 01:12:38] Total: 608.31 khash/s
[2014-01-26 01:12:43] GPU #0: GeForce GTX TITAN, 606.79 khash/s
[2014-01-26 01:12:43] Total: 606.79 khash/s
[2014-01-26 01:12:48] GPU #0: GeForce GTX TITAN, 605.39 khash/s
[2014-01-26 01:12:48] Total: 605.39 khash/s

On an i7 930 LGA1366 @ 3.8GHz, reference clocked GTX Titan, Linux Mint 16.

I did change “-g -02” in the makefile to “-O3 -march=native”, but I’m not sure if that makes a difference.

[2014-01-26 09:27:24] GPU #0: using launch configuration Z14x24
[2014-01-26 09:27:24] GPU #0: GeForce GTX TITAN, 232.42 khash/s
[2014-01-26 09:27:24] Total: 232.42 khash/s
[2014-01-26 09:27:27] GPU #0: GeForce GTX TITAN, 502.55 khash/s