Is there an intrinsic function available that could perform a bitwise roll of a uint32?
We are currently accomplishing this via the C macro below, which generates 3 cuda assembly instructions. Having this available as a single assembly instruction accessible via an intrinsic function would quite helpful.
#define SHL(x, s) ((u32) ((x) << ((s) & 31))) #define SHR(x, s) ((u32) ((x) >> (32 - ((s) & 31)))) #define ROTL(x, s) ((u32) (SHL((x), (s)) | SHR((x), (s)))) A = ROTL(A, 3);
Cuda Assembly Output:
shr.u32 $r80, $r79, 29; # shl.u32 $r81, $r79, 3; # or.u32 $r82, $r80, $r81; #
Rotating by a variable number of bits produces the following assembly output:
A = ROTL(A, X); and.u32 $r136, $r134, 31; # mov.s32 $r137, 32; # sub.s32 $r138, $r137, $r136; # shr.u32 $r139, $r135, $r138; # shl.u32 $r140, $r135, $r136; # or.u32 $r7, $r139, $r140; #
A large portion of the dnetc rc5-72 kernel (r72cuda1.cu) is these ROL operations. While the cuda kernel performance is nothing short of impressive, a ROL intrinsic would reduce the total core instruction count from 1421 instructions to 822. This would give us a theoretical performance improvement of 1.7x over the current key rate.
Feel free to offer any other suggestions on how to improve the performance of the core.