AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance

RogerDahl · June 15, 2011, 8:54pm

Bitcoin mining is essentially SHA-256 hashing.

According to the table at [url=“http://bitminer.info/”]http://bitminer.info/[/url], the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.

An explanation for this is provided at https://en.bitcoin.i…an_Nvidia_GPUs?

The explanation states that the Radeon is faster not only on SHA-256 hashing, but on “all ALU-bound GPGPU workloads”. Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.

I’m wondering what the CUDA community’s take on this is. Is the Radeon really faster on “all ALU-bound GPGPU workloads” at a given price point? If so, what is NVIDIA faster on?

RogerDahl · June 15, 2011, 8:54pm

Bitcoin mining is essentially SHA-256 hashing.

According to the table at [url=“http://bitminer.info/”]http://bitminer.info/[/url], the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.

An explanation for this is provided at https://en.bitcoin.i…an_Nvidia_GPUs?

The explanation states that the Radeon is faster not only on SHA-256 hashing, but on “all ALU-bound GPGPU workloads”. Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.

I’m wondering what the CUDA community’s take on this is. Is the Radeon really faster on “all ALU-bound GPGPU workloads” at a given price point? If so, what is NVIDIA faster on?

hamster143 · June 15, 2011, 10:29pm

I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.

NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.

hamster143 · June 15, 2011, 10:29pm

I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.

NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.

Jimmy_Pettersson · June 16, 2011, 2:02pm

Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficÃently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.

Jimmy_Pettersson · June 16, 2011, 2:02pm

Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficÃently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.

NeoVidio · July 9, 2011, 10:33pm

Is hence possible to make a faster SHA engine only with ALU?

Sarnath · July 10, 2011, 3:16pm

Most HPC related problems are data-parallel and hence vectorizable. However GPGPU is moving non-traditional HPC problems to GPU. So, in those cases, achieving performance with AMD is a bit challenging - Totally depends on the problem in question…

As far as memory bandwidth, AMD can manage well even if there is lot of non-coalescedness in your program.

It aint too bad… AMD cards can give the bang for the buck as much as NVIDIA does… And, OpenCL is a standard anyway…

But, OpenCL does not really mitigate the portability issues. Separate kernels are sometimes needed to address AMD and NVIDIA platofmrs separately…
You may want to check Dr.Dongarra’s paper on writing high performance BLAS kernels in OpenCL. Google for it. You should be able to find… Dr.Dongarra is pioneer in BLAS, LAPACK world… He still is. One of the Best in the field.

seibert · July 10, 2011, 5:21pm

Most HPC related problems are data-parallel and hence vectorizable. However GPGPU is moving non-traditional HPC problems to GPU. So, in those cases, achieving performance with AMD is a bit challenging - Totally depends on the problem in question…

As far as memory bandwidth, AMD can manage well even if there is lot of non-coalescedness in your program.

It aint too bad… AMD cards can give the bang for the buck as much as NVIDIA does… And, OpenCL is a standard anyway…

But, OpenCL does not really mitigate the portability issues. Separate kernels are sometimes needed to address AMD and NVIDIA platofmrs separately…

You may want to check Dr.Dongarra’s paper on writing high performance BLAS kernels in OpenCL. Google for it. You should be able to find… Dr.Dongarra is pioneer in BLAS, LAPACK world… He still is. One of the Best in the field.

Hopefully in another GPU generation or two, we’ll see some architectural convergence that will make OpenCL work better across platforms. It already looks like AMD is moving in the direction of NVIDIA for their next architecture:

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute

Sarnath · July 11, 2011, 5:15am

Thanks for the link. Waiting to see how this whole new fusion thing will pan out…

Sarnath · July 11, 2011, 7:04am

AMD5970 - 530 MHash

Prev Generation card producing such stunning numbers? (suspect a typo)

hamster143 · July 12, 2011, 12:19am

No typo. 5xxx and 6xxx are the same fabrication process (40 nm), design differences are minor. In fact, in terms of #hash per watt, 5970 may be faster than 6990. The table on bitminer.info shows 5970 at 530 MHash and 294W, 6990 at 670 MHash and 346 watt, but all news sources say that 6990 really eats 375 watt at full load.

Jimmy_Pettersson · July 12, 2011, 6:57am

Wait, isn’t the 6970 a dual GPU ? 570 is a single GPU.

A fair comparison would be GTX590 vs the radeon 6970 wouldn’t it?

avidday · July 12, 2011, 7:15am

No, that is the 6990.

oasis789 · July 14, 2011, 6:22pm

This is unfortunate, since I am very interested in bitcoin mining AND I’d like to keep being an Nvidia guy, since my rig is intended for gaming first.

It’s great that Nvidia cards are better at folding@home :)

NeoVidio · July 15, 2011, 6:09am

Thats true. But I still think there is another way of improving the SHA engine with this class of GPU.

spadflyer12 · July 15, 2011, 5:29pm

I was actually looking into GPU bitcoin mining and I noticed that all of the gpu mining programs use OpenCL. It could very well be that the current miners are more optimized for the ATI cards. It is also entirely possible that even then the code is poorly optimized for the ATI cards, but any optimization one way will make once card better than the other.

There was a paper that someone posted a while back that did a study on the performance differences between OpenCL and cuda, and compared performance between OpenCL routines optimized for ATI vs Nvidia cards. Basically any routine optimized to perform well on ATI cards performs poorly on Nvidia cards. However, the same algorithm can be optimized for Nvidia cards, in which case it performs poorly on ATI cards.

So what I’m trying to get at is that someone needs to write a well optimized miner in Cuda, in which case you will probably see equivalent, and possibly better performance.

RogerDahl · July 15, 2011, 7:14pm

There is a CUDA miner available here:

http://forum.bitcoin.org/?topic=2444.0

Here’s the kernel from it:

http://pastebin.com/pRWTsLPT

However, I don’t know if that is the implementation that was used for the comparison I referenced in the initial post.

It is apparent that the implementation of rotateright() is important for the overall performance. I’m wondering if Fermi has gained a suitable instruction for it?

An idea for speeding up the algorithm is to rewrite it in PTX. And since it’s so repetitive, create a script that generates the PTX. The PTX can then be inserted into the kernel and compiled with the new inline PTX capability in CUDA 4. I plan on trying this out in a few weeks.

RogerDahl · July 15, 2011, 8:05pm

Unfortunately, it appears that there’s no ROR instruction in Fermi.

The Fermi instruction set is listed in the cuobjdump.pdf file that comes with CUDA 4. The PDF did not seem to be readily available online, so I’ve made it available here:

http://www.dahlsys.c…d/cuobjdump.pdf

For reference, here’s an assembly implementation of SHA-256:

http://read.pudn.com…ha256.asm__.htm

RogerDahl · July 17, 2011, 1:02am

The number of bits positions to rotate in each ROR (Rotate Right) are known at compile time. There are 10 needed bit positions:

ROR 2, 6, 7, 11, 13, 17, 18, 19, 22, 25

By default, each of these are handled by 3 instructions, LSL + LSR + OR. The grand challenge is to come up with single instructions or combinations of two instructions from the Fermi instruction set that accomplish these.

For instance, it looks like the PRMT (Permute bytes from register pair) instruction (details on Page 79 in the PTX 2.1 ISA) can be used for ROR 8, 16 and 24. Unfortunately those are not among the 10 that are required. But maybe there’s an instruction that can do a ROR 1? If so, the two could be combined to create the required ROR 17 and 25.

Though doubtful, maybe the Fermi cache hierarchy and some small tables can help as well.

Topic		Replies	Views
Could anyone benchmark this for me on a 780 (Ti) or Titan? CUDA Programming and Performance	57	20984	February 16, 2014
performance of new nvidia chip CUDA Programming and Performance	15	6451	January 5, 2010
Is ATI Stream better for encryption-type programming? CUDA Programming and Performance	13	17544	November 9, 2010
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37670	August 30, 2009
Speedy CUDA tool to win EngineYard SHA1 contest CUDA Programming and Performance	239	97283	August 12, 2009
Why is whirlpool hash so slow on cuda? CUDA Programming and Performance	13	4147	June 30, 2014
CUDA vs ATI Stream comparison CUDA Programming and Performance	22	93761	March 12, 2010
Cuda 7.5 give a 30% performance loss vs cuda 6.5 CUDA Programming and Performance	33	13565	May 11, 2016
You should assist in the cudaminer development CUDA Programming and Performance	9	2698	February 5, 2014
does ATI sleep? CUDA Programming and Performance	25	25122	January 6, 2009

AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance

Related topics