AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance

Just to echo @njuffa’s comments. Here are the SASS instruction counts by architecture for a basic SHA-256 kernel:

  • sm_35: 2550 [*]sm_30: 3202
  • … and for other architectures:

  • sm_21: 2802 [*]sm_12: 3004 [*]sm_11: 3016
  • I produced these counts by compiling a cubin, dumping the SASS and dividing the last instruction address by 8.

    If you dump the unique instructions the only differences between sm_30 and sm_35 are:

  • sm_30: SHR.U32, ISCADD [*]sm_35: SHF.L.W, SHF.R
  • And here are the instruction counts for each arch:

    SM_30:
          1         BRA
        517         IADD
         76         IADD32I
        570         ISCADD
        246         LOP.AND
        630         LOP.XOR
          5         LOP32I.AND
          1         LOP32I.OR
          1         LOP32I.XOR
         17         MOV
          1         MOV32I
          4         NOP
        666         SHR.U32
          8         ST
    

    -and-

    SM_35:
          1         BRA
        517         IADD
         76         IADD32I
        246         LOP.AND
        630         LOP.XOR
          5         LOP32I.AND
          1         LOP32I.OR
          1         LOP32I.XOR
         17         MOV
          1         MOV32I
        570         SHF.L.W
         96         SHF.R
          8         ST
    

    The big difference is that sm_30 executes an extra 570 ISCADD ops.

    There are also 666 SHR.U32 ops (ominous!) in sm_30 which is matched by 570 SHF.L.W + 96 SHF.R ops. Note that 570 + 96 = 666.

    This regexp will let you see the unique instructions:

    cuobjdump -sass <cubin> | grep -o --perl-regex "<tab>[A-Z32<period>]+<whitespace>" | sort | uniq -c