AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance

Bitcoin mining is essentially SHA-256 hashing.

According to the table at http://bitminer.info/, the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.

An explanation for this is provided at https://en.bitcoin.i…an_Nvidia_GPUs?

The explanation states that the Radeon is faster not only on SHA-256 hashing, but on “all ALU-bound GPGPU workloads”. Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.

I’m wondering what the CUDA community’s take on this is. Is the Radeon really faster on “all ALU-bound GPGPU workloads” at a given price point? If so, what is NVIDIA faster on?

Bitcoin mining is essentially SHA-256 hashing.

According to the table at http://bitminer.info/, the Radeon 6970 ($330) is able to run bitcoin mining at 323 M-hash/s while the GTX 570 ($330) runs it at 105 M-hash/s. The Radeon is 3x faster.

An explanation for this is provided at https://en.bitcoin.i…an_Nvidia_GPUs?

The explanation states that the Radeon is faster not only on SHA-256 hashing, but on “all ALU-bound GPGPU workloads”. Further, it explains that Radeon has a particular advantage on SHA-256 hashing because it has an instruction for 32 bit integer right rotation, which NVIDIA GPUs do not have.

I’m wondering what the CUDA community’s take on this is. Is the Radeon really faster on “all ALU-bound GPGPU workloads” at a given price point? If so, what is NVIDIA faster on?

I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.

NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.

I would say that Radeon is asymptotically faster on ALU-bound workloads in the limit of infinite programmer time. So if you have a particularly simple, vectorizable task where AMD compiler can do a good job, you might see 2-3x advantage right away. If you have a complicated and poorly vectorizable task, making the program faster on AMD may take substantial effort.

NVIDIA is faster on non-vectorized tasks, especially if they involve memory accesses, especially if your memory accesses are shorter than 4 bytes. For example, performing an operation with a 1-byte operand in L1 cache has no overhead on NVIDIA (as far as I know). To do same on AMD, the compiler has to generate a complicated explicit sequence of instructions, effectively making that single access take as long as 20-30 ALU instructions.

Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.

Yes I think the theoretical difference is somewhere ~2.7 TFLOPS vs ~1.5 TFLOPS on AMD and Nvidida respectively so if you have an extremely compute bound problem you might approach this limit. The bandwidth difference is more or less negligible and is usually the limiting factor. But as hamster points out the AMD 4-VLIW arch is often harder to utilize efficíently, their ALUs can be considered to not being as general purpose as the nvidia FPUs.

Is hence possible to make a faster SHA engine only with ALU?

Most HPC related problems are data-parallel and hence vectorizable. However GPGPU is moving non-traditional HPC problems to GPU. So, in those cases, achieving performance with AMD is a bit challenging - Totally depends on the problem in question…

As far as memory bandwidth, AMD can manage well even if there is lot of non-coalescedness in your program.

It aint too bad… AMD cards can give the bang for the buck as much as NVIDIA does… And, OpenCL is a standard anyway…

But, OpenCL does not really mitigate the portability issues. Separate kernels are sometimes needed to address AMD and NVIDIA platofmrs separately…
You may want to check Dr.Dongarra’s paper on writing high performance BLAS kernels in OpenCL. Google for it. You should be able to find… Dr.Dongarra is pioneer in BLAS, LAPACK world… He still is. One of the Best in the field.

Hopefully in another GPU generation or two, we’ll see some architectural convergence that will make OpenCL work better across platforms. It already looks like AMD is moving in the direction of NVIDIA for their next architecture:

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute

Thanks for the link. Waiting to see how this whole new fusion thing will pan out…

AMD5970 - 530 MHash

Prev Generation card producing such stunning numbers? (suspect a typo)

No typo. 5xxx and 6xxx are the same fabrication process (40 nm), design differences are minor. In fact, in terms of #hash per watt, 5970 may be faster than 6990. The table on bitminer.info shows 5970 at 530 MHash and 294W, 6990 at 670 MHash and 346 watt, but all news sources say that 6990 really eats 375 watt at full load.

Wait, isn’t the 6970 a dual GPU ? 570 is a single GPU.

A fair comparison would be GTX590 vs the radeon 6970 wouldn’t it?

No, that is the 6990.

This is unfortunate, since I am very interested in bitcoin mining AND I’d like to keep being an Nvidia guy, since my rig is intended for gaming first.

It’s great that Nvidia cards are better at folding@home :)

Thats true. But I still think there is another way of improving the SHA engine with this class of GPU.

I was actually looking into GPU bitcoin mining and I noticed that all of the gpu mining programs use OpenCL. It could very well be that the current miners are more optimized for the ATI cards. It is also entirely possible that even then the code is poorly optimized for the ATI cards, but any optimization one way will make once card better than the other.

There was a paper that someone posted a while back that did a study on the performance differences between OpenCL and cuda, and compared performance between OpenCL routines optimized for ATI vs Nvidia cards. Basically any routine optimized to perform well on ATI cards performs poorly on Nvidia cards. However, the same algorithm can be optimized for Nvidia cards, in which case it performs poorly on ATI cards.

So what I’m trying to get at is that someone needs to write a well optimized miner in Cuda, in which case you will probably see equivalent, and possibly better performance.

There is a CUDA miner available here:

http://forum.bitcoin.org/?topic=2444.0

Here’s the kernel from it:

http://pastebin.com/pRWTsLPT

However, I don’t know if that is the implementation that was used for the comparison I referenced in the initial post.

It is apparent that the implementation of rotateright() is important for the overall performance. I’m wondering if Fermi has gained a suitable instruction for it?

An idea for speeding up the algorithm is to rewrite it in PTX. And since it’s so repetitive, create a script that generates the PTX. The PTX can then be inserted into the kernel and compiled with the new inline PTX capability in CUDA 4. I plan on trying this out in a few weeks.

Unfortunately, it appears that there’s no ROR instruction in Fermi.

The Fermi instruction set is listed in the cuobjdump.pdf file that comes with CUDA 4. The PDF did not seem to be readily available online, so I’ve made it available here:

http://www.dahlsys.c…d/cuobjdump.pdf

For reference, here’s an assembly implementation of SHA-256:

http://read.pudn.com…ha256.asm__.htm

The number of bits positions to rotate in each ROR (Rotate Right) are known at compile time. There are 10 needed bit positions:

ROR 2, 6, 7, 11, 13, 17, 18, 19, 22, 25

By default, each of these are handled by 3 instructions, LSL + LSR + OR. The grand challenge is to come up with single instructions or combinations of two instructions from the Fermi instruction set that accomplish these.

For instance, it looks like the PRMT (Permute bytes from register pair) instruction (details on Page 79 in the PTX 2.1 ISA) can be used for ROR 8, 16 and 24. Unfortunately those are not among the 10 that are required. But maybe there’s an instruction that can do a ROR 1? If so, the two could be combined to create the required ROR 17 and 25.

Though doubtful, maybe the Fermi cache hierarchy and some small tables can help as well.