Just to echo @njuffa’s comments. Here are the SASS instruction counts by architecture for a basic SHA-256 kernel:
… and for other architectures:
I produced these counts by compiling a cubin, dumping the SASS and dividing the last instruction address by 8.
If you dump the unique instructions the only differences between sm_30 and sm_35 are:
And here are the instruction counts for each arch:
SM_30:
1 BRA
517 IADD
76 IADD32I
570 ISCADD
246 LOP.AND
630 LOP.XOR
5 LOP32I.AND
1 LOP32I.OR
1 LOP32I.XOR
17 MOV
1 MOV32I
4 NOP
666 SHR.U32
8 ST
-and-
SM_35:
1 BRA
517 IADD
76 IADD32I
246 LOP.AND
630 LOP.XOR
5 LOP32I.AND
1 LOP32I.OR
1 LOP32I.XOR
17 MOV
1 MOV32I
570 SHF.L.W
96 SHF.R
8 ST
The big difference is that sm_30 executes an extra 570 ISCADD ops.
There are also 666 SHR.U32 ops (ominous!) in sm_30 which is matched by 570 SHF.L.W + 96 SHF.R ops. Note that 570 + 96 = 666.
This regexp will let you see the unique instructions:
cuobjdump -sass <cubin> | grep -o --perl-regex "<tab>[A-Z32<period>]+<whitespace>" | sort | uniq -c