How is “significant” defined in this context? The rotate operations in SHA256 benefit from the funnel shifter, which allows each rotate to map to a single SHF instruction, versus two instructions on sm_2x and sm_30 (SHR + ISCADD). In an ad-hoc experiment, the reduction in dynamic instruction count is about 20%.
Just to echo @njuffa’s comments. Here are the SASS instruction counts by architecture for a basic SHA-256 kernel:
… and for other architectures:
I produced these counts by compiling a cubin, dumping the SASS and dividing the last instruction address by 8.
If you dump the unique instructions the only differences between sm_30 and sm_35 are:
And here are the instruction counts for each arch:
SM_30:
1 BRA
517 IADD
76 IADD32I
570 ISCADD
246 LOP.AND
630 LOP.XOR
5 LOP32I.AND
1 LOP32I.OR
1 LOP32I.XOR
17 MOV
1 MOV32I
4 NOP
666 SHR.U32
8 ST
-and-
SM_35:
1 BRA
517 IADD
76 IADD32I
246 LOP.AND
630 LOP.XOR
5 LOP32I.AND
1 LOP32I.OR
1 LOP32I.XOR
17 MOV
1 MOV32I
570 SHF.L.W
96 SHF.R
8 ST
The big difference is that sm_30 executes an extra 570 ISCADD ops.
There are also 666 SHR.U32 ops (ominous!) in sm_30 which is matched by 570 SHF.L.W + 96 SHF.R ops. Note that 570 + 96 = 666.
This regexp will let you see the unique instructions:
cuobjdump -sass <cubin> | grep -o --perl-regex "<tab>[A-Z32<period>]+<whitespace>" | sort | uniq -c
Your (static) instruction counts suggest that you fully unrolled all loops. That may not result in the best performance. In my quick experiment I saw best performance when unrolling the main loop by a factor of 8. I assume (but haven’t checked) that this is due to the fully unrolled version exceeding the ICache size. As for overall speedup, I observed a factor of about 2.5x between C2050 and K20c. I mention this for illustration purposes only since I simply grabbed the first SHA256 code I could find and compiled it with the latest internal compiler, so “your mileage may vary”.
There are various ways of re-writing the Boolean logic in the main loop for example, some of these variants may have better ILP than others, and the compiler may or may not be able to automatically transform into the most advantageous one. So it would be best to experiment and manually select the fastest variant. Kepler is the first GPU architecture where ILP plays a role in perf optimization, although in my experience so far this has been a second order effect (single-digit percentages).
You are correct! It’s a fully unrolled implementation using C X-Macros. I don’t use it for anything so I don’t know how fast it is. Happy to post the code if anyone wants it.
Oh, and only “chunk 0” bears the mark of the beast because the compiler is able to remove one step made up of constants. All following chunks perform 672 (666+6) rot+shr ops.
I also am interested in this thing, do not understand why ’ we should be lower. someone is working on it ? ? hello
nobody ’ interested in solving this problem ?
Hey I published a litecoin miner for nVidia GPUs. It’s called cudaminer, and on average it’s 50% faster than any OpenCL implementation so far. I have that posted in the Alternative Cryptocurrency subforum on the bitcointalk forums.
The main difficulty is the scrypt hashing algorithm which is memory hard. i.e. requires huge lookup tables and has random access patterns. So the best speed-ups I got so far were from optimizing the global memory access patterns. The other speed-ups that I achieved were done by increasing occupancy. I had to cut my shared memory use so the SMX on Kepler devices are nicely occupied.
Check it out, my program comes with source code.
I even have some Titan funnel shifter support in it, although I was getting reports that it is broken in the April 10th release. Some performance figures that I got from a Tesla K20c card previously weren’t so convincing, i.e. the funner shifter wasn’t really cutting it because we also had the memory wall as a limiting factor.
The cards in the GTX5xx series (a 2nd gen re-spin of the Fermi architecture) are best for litecoin mining, apparently. Kepler devices are lagging about 20% behind that benchmark, performance-wise. And we’re still lagging behind the ATI cards, sigh - even though my work decreases the margin a little.
Christian
“32 bit integer right rotation, which NVIDIA GPUs do not have”
I found out that CC 3.5 devices now have a funnel shift (SHF) that let’s you do bit rotation with 1 instruction. I’m wondering how much that will help speed things up.
Can someone with a GK110 recompile a CUDA miner, and see if NVCC is smart enough to use the new instruction and what the speedup is?
For some idea about the benefit of the funnel shift instructions (SHF) added in sm_35, see previous messages in this thread.
Sorry, I was reading only the 1st page and didn’t see the thread was already 4 pages long.
I see the improvement is 20% fewer instructions (according to allanmac), but it seems AMD’s large number of arithmetic units still makes it much faster.
What I do not understand is how ATI cards can also beat us at litecoin mining, as the memory wall will hit both architectures in much the same way. [url]Random-access memory - Wikipedia
I have some performance figures for Geforce Titan: 290 kHash/s vs e.g. a Radeon HD 7950 at around 600 kHash/s. This puzzles me.
Note that in litecoin mining we have kHash (not MHash) mostly because the scrypt hashing requires so much memory bandwidth.
Christian
Christian,
Been trying to get in touch with you, actually. Does a Sandy Bridge-E system (quad-channel memory bandwidth) add a significant performance increase to hashing compared to a conventional IVB dual-channel system?
Is it possible that the performance difference in LC is due to difference in efficiency for GPU memory access between AMD and NV?
Follow-up question to this, as far as improving NV’s BTC/LTC mining performance. Much of the discussion around the 'net has focused on CUDA implementations, but according to NV docs, Titan can execute 64 int32 per SMX.
What do we know about GCN’s int32 execution rate?
since there are so many around , you can provide a direct link to the download? ? I have a gtx 560 golden sample model of the gainward 1gb. thanks
rpcminer-cuda do not working, how to use on windows 7 x64???
ok worker, rpcminer-mod-cuda,result yes and no.
as I said I’m trying rpcminer -mod- cuda , with my gtx 560 ( gainward golden sample) do 900 mhz 80000 kh / s, but we are still far away from the radeon the same level. you think you can solve this problem or I have to go to the competition? ? hello
I only try to solve the scrypt hashing problem with CUDA. I am not going to look at SHA-256. Maybe someone else will.
Here’s the thread that hosts my application: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX]
and a link to its github repo: GitHub - cbuchner1/CudaMiner: a CUDA accelerated litecoin mining application based on pooler's CPU miner
ok.what I want to understand and ’ a software problem that should solve or nvidia and ’ a lack hardware ? ? hello
The answers are already in the thread, specifically:
[url]AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance - CUDA Programming and Performance - NVIDIA Developer Forums
Or in other words, if you want fast bitcoin mining, NVIDIA is not the answer. Go look at ATI or ASICs…