AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance

njuffa · March 27, 2013, 12:02am

How is “significant” defined in this context? The rotate operations in SHA256 benefit from the funnel shifter, which allows each rotate to map to a single SHF instruction, versus two instructions on sm_2x and sm_30 (SHR + ISCADD). In an ad-hoc experiment, the reduction in dynamic instruction count is about 20%.

allanmac · March 27, 2013, 11:48pm

Just to echo @njuffa’s comments. Here are the SASS instruction counts by architecture for a basic SHA-256 kernel:

sm_35: 2550 [*]sm_30: 3202

… and for other architectures:

sm_21: 2802 [*]sm_12: 3004 [*]sm_11: 3016

I produced these counts by compiling a cubin, dumping the SASS and dividing the last instruction address by 8.

If you dump the unique instructions the only differences between sm_30 and sm_35 are:

sm_30: SHR.U32, ISCADD [*]sm_35: SHF.L.W, SHF.R

And here are the instruction counts for each arch:

SM_30:
      1         BRA
    517         IADD
     76         IADD32I
    570         ISCADD
    246         LOP.AND
    630         LOP.XOR
      5         LOP32I.AND
      1         LOP32I.OR
      1         LOP32I.XOR
     17         MOV
      1         MOV32I
      4         NOP
    666         SHR.U32
      8         ST

-and-

SM_35:
      1         BRA
    517         IADD
     76         IADD32I
    246         LOP.AND
    630         LOP.XOR
      5         LOP32I.AND
      1         LOP32I.OR
      1         LOP32I.XOR
     17         MOV
      1         MOV32I
    570         SHF.L.W
     96         SHF.R
      8         ST

The big difference is that sm_30 executes an extra 570 ISCADD ops.

There are also 666 SHR.U32 ops (ominous!) in sm_30 which is matched by 570 SHF.L.W + 96 SHF.R ops. Note that 570 + 96 = 666.

This regexp will let you see the unique instructions:

cuobjdump -sass <cubin> | grep -o --perl-regex "<tab>[A-Z32<period>]+<whitespace>" | sort | uniq -c

njuffa · March 28, 2013, 10:58pm

Your (static) instruction counts suggest that you fully unrolled all loops. That may not result in the best performance. In my quick experiment I saw best performance when unrolling the main loop by a factor of 8. I assume (but haven’t checked) that this is due to the fully unrolled version exceeding the ICache size. As for overall speedup, I observed a factor of about 2.5x between C2050 and K20c. I mention this for illustration purposes only since I simply grabbed the first SHA256 code I could find and compiled it with the latest internal compiler, so “your mileage may vary”.

There are various ways of re-writing the Boolean logic in the main loop for example, some of these variants may have better ILP than others, and the compiler may or may not be able to automatically transform into the most advantageous one. So it would be best to experiment and manually select the fastest variant. Kepler is the first GPU architecture where ILP plays a role in perf optimization, although in my experience so far this has been a second order effect (single-digit percentages).

allanmac · March 28, 2013, 11:03pm

You are correct! It’s a fully unrolled implementation using C X-Macros. I don’t use it for anything so I don’t know how fast it is. Happy to post the code if anyone wants it.

Oh, and only “chunk 0” bears the mark of the beast because the compiler is able to remove one step made up of constants. All following chunks perform 672 (666+6) rot+shr ops.

devilred · April 2, 2013, 12:11pm

I also am interested in this thing, do not understand why ’ we should be lower. someone is working on it ? ? hello

devilred · April 11, 2013, 8:24am

nobody ’ interested in solving this problem ?

cbuchner1 · April 11, 2013, 4:59pm

Hey I published a litecoin miner for nVidia GPUs. It’s called cudaminer, and on average it’s 50% faster than any OpenCL implementation so far. I have that posted in the Alternative Cryptocurrency subforum on the bitcointalk forums.

The main difficulty is the scrypt hashing algorithm which is memory hard. i.e. requires huge lookup tables and has random access patterns. So the best speed-ups I got so far were from optimizing the global memory access patterns. The other speed-ups that I achieved were done by increasing occupancy. I had to cut my shared memory use so the SMX on Kepler devices are nicely occupied.

Check it out, my program comes with source code.

I even have some Titan funnel shifter support in it, although I was getting reports that it is broken in the April 10th release. Some performance figures that I got from a Tesla K20c card previously weren’t so convincing, i.e. the funner shifter wasn’t really cutting it because we also had the memory wall as a limiting factor.

The cards in the GTX5xx series (a 2nd gen re-spin of the Fermi architecture) are best for litecoin mining, apparently. Kepler devices are lagging about 20% behind that benchmark, performance-wise. And we’re still lagging behind the ATI cards, sigh - even though my work decreases the margin a little.

Christian

Uncle_Joe · April 11, 2013, 6:01pm

“32 bit integer right rotation, which NVIDIA GPUs do not have”

I found out that CC 3.5 devices now have a funnel shift (SHF) that let’s you do bit rotation with 1 instruction. I’m wondering how much that will help speed things up.

Can someone with a GK110 recompile a CUDA miner, and see if NVCC is smart enough to use the new instruction and what the speedup is?

njuffa · April 11, 2013, 6:04pm

For some idea about the benefit of the funnel shift instructions (SHF) added in sm_35, see previous messages in this thread.

Uncle_Joe · April 11, 2013, 8:36pm

Sorry, I was reading only the 1st page and didn’t see the thread was already 4 pages long.

I see the improvement is 20% fewer instructions (according to allanmac), but it seems AMD’s large number of arithmetic units still makes it much faster.

cbuchner1 · April 11, 2013, 8:55pm

What I do not understand is how ATI cards can also beat us at litecoin mining, as the memory wall will hit both architectures in much the same way. [url]Random-access memory - Wikipedia

I have some performance figures for Geforce Titan: 290 kHash/s vs e.g. a Radeon HD 7950 at around 600 kHash/s. This puzzles me.

Note that in litecoin mining we have kHash (not MHash) mostly because the scrypt hashing requires so much memory bandwidth.

Christian

DigiHound · April 12, 2013, 5:49pm

Christian,

Been trying to get in touch with you, actually. Does a Sandy Bridge-E system (quad-channel memory bandwidth) add a significant performance increase to hashing compared to a conventional IVB dual-channel system?

Is it possible that the performance difference in LC is due to difference in efficiency for GPU memory access between AMD and NV?

DigiHound · April 12, 2013, 8:02pm

Follow-up question to this, as far as improving NV’s BTC/LTC mining performance. Much of the discussion around the 'net has focused on CUDA implementations, but according to NV docs, Titan can execute 64 int32 per SMX.

What do we know about GCN’s int32 execution rate?

devilred · April 14, 2013, 10:19am

cbuchner1:

Hey I published a litecoin miner for nVidia GPUs. It’s called cudaminer, and on average it’s 50% faster than any OpenCL implementation so far. I have that posted in the Alternative Cryptocurrency subforum on the bitcointalk forums.

The main difficulty is the scrypt hashing algorithm which is memory hard. i.e. requires huge lookup tables and has random access patterns. So the best speed-ups I got so far were from optimizing the global memory access patterns. The other speed-ups that I achieved were done by increasing occupancy. I had to cut my shared memory use so the SMX on Kepler devices are nicely occupied.

Check it out, my program comes with source code.

I even have some Titan funnel shifter support in it, although I was getting reports that it is broken in the April 10th release. Some performance figures that I got from a Tesla K20c card previously weren’t so convincing, i.e. the funner shifter wasn’t really cutting it because we also had the memory wall as a limiting factor.

The cards in the GTX5xx series (a 2nd gen re-spin of the Fermi architecture) are best for litecoin mining, apparently. Kepler devices are lagging about 20% behind that benchmark, performance-wise. And we’re still lagging behind the ATI cards, sigh - even though my work decreases the margin a little.

Christian

since there are so many around , you can provide a direct link to the download? ? I have a gtx 560 golden sample model of the gainward 1gb. thanks

devilred · April 15, 2013, 9:21am

rpcminer-cuda do not working, how to use on windows 7 x64???

devilred · April 20, 2013, 11:54am

ok worker, rpcminer-mod-cuda,result yes and no.

devilred · April 29, 2013, 10:47am

as I said I’m trying rpcminer -mod- cuda , with my gtx 560 ( gainward golden sample) do 900 mhz 80000 kh / s, but we are still far away from the radeon the same level. you think you can solve this problem or I have to go to the competition? ? hello

cbuchner1 · April 29, 2013, 4:41pm

I only try to solve the scrypt hashing problem with CUDA. I am not going to look at SHA-256. Maybe someone else will.

Here’s the thread that hosts my application: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX]
and a link to its github repo: GitHub - cbuchner1/CudaMiner: a CUDA accelerated litecoin mining application based on pooler's CPU miner

devilred · May 1, 2013, 7:44am

ok.what I want to understand and ’ a software problem that should solve or nvidia and ’ a lack hardware ? ? hello

vacaloca · May 1, 2013, 2:18pm

@devilred,

The answers are already in the thread, specifically:
[url]AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance - CUDA Programming and Performance - NVIDIA Developer Forums
Or in other words, if you want fast bitcoin mining, NVIDIA is not the answer. Go look at ATI or ASICs…

Topic		Replies	Views
Could anyone benchmark this for me on a 780 (Ti) or Titan? CUDA Programming and Performance	57	20956	February 16, 2014
Cuda 7.5 give a 30% performance loss vs cuda 6.5 CUDA Programming and Performance	33	13533	May 11, 2016
performance of new nvidia chip CUDA Programming and Performance	15	6434	January 5, 2010
Why is whirlpool hash so slow on cuda? CUDA Programming and Performance	13	4136	June 30, 2014
Bitcoin miner Jetson TX2	14	39461	January 31, 2018
An Interesting Development - New CMP architecture CUDA Programming and Performance	31	2775	April 20, 2021
You should assist in the cudaminer development CUDA Programming and Performance	9	2680	February 5, 2014
Is ATI Stream better for encryption-type programming? CUDA Programming and Performance	13	17536	November 9, 2010
So what's new about Maxwell? CUDA Programming and Performance	166	56244	March 10, 2015
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37622	August 30, 2009

AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance

Related topics