ROL intrinsic function

Is there an intrinsic function available that could perform a bitwise roll of a uint32?

We are currently accomplishing this via the C macro below, which generates 3 cuda assembly instructions. Having this available as a single assembly instruction accessible via an intrinsic function would quite helpful.

Macro:

#define SHL(x, s) ((u32) ((x) << ((s) & 31)))

#define SHR(x, s) ((u32) ((x) >> (32 - ((s) & 31))))

#define ROTL(x, s) ((u32) (SHL((x), (s)) | SHR((x), (s))))

A = ROTL(A, 3);

Cuda Assembly Output:

       shr.u32         $r80, $r79, 29;         #

        shl.u32         $r81, $r79, 3;          #

        or.u32  $r82, $r80, $r81;       #

Rotating by a variable number of bits produces the following assembly output:

A = ROTL(A, X);

       and.u32         $r136, $r134, 31;       #

        mov.s32         $r137, 32;              #

        sub.s32         $r138, $r137, $r136;    #

        shr.u32         $r139, $r135, $r138;    #

        shl.u32         $r140, $r135, $r136;    #

        or.u32  $r7, $r139, $r140;      #

A large portion of the dnetc rc5-72 kernel (r72cuda1.cu) is these ROL operations. While the cuda kernel performance is nothing short of impressive, a ROL intrinsic would reduce the total core instruction count from 1421 instructions to 822. This would give us a theoretical performance improvement of 1.7x over the current key rate.

Feel free to offer any other suggestions on how to improve the performance of the core.

Thanks,

Paul

Bump… Hoping to get a reply out of someone. Hopefully an nvidia person.

Paul,

There is no native instruction on G80 for ROL.

Most CPU architectures require two shifts and an OR to do a ROL/ROR, so does CUDA.

Thanks for posting the link to your code. How does 144M keys/second compare to the performance on the CPU?

I’m not a crypto expert, but looking at the code, you don’t seem to be making much use of shared memory.

Would it possible to store the “cuda_key_init” array in shared memory rather than constant memory? Constant memory performance can be sub-optimal if each thread is accessing a different address.

On my E4300 (1.8 Ghz C2D) the fastest core is ~ 6.5 Mkeys/sec per core, for a total of ~13 Mkeys/sec for both cores of the processor. The PPC core uses altivec and gets around 11 Mkeys/sec on my 1.25 Ghz mac mini.

To be honest, I’m not a crypto expert either. When writing the core, I tried to look at the problem so that there would be the least amount of global, local and shared memory accesses and inter-thread communication. More computation, less communication :)

I was wondering, the cubin file states that the core uses 68 bytes of shared memory. I believe that 52 bytes of that is the arguments and the rest of that is blockIdx.x blockDim.x and threaadIdx.x. Is that correct?

Mercurial archive with the latest code:

http://dungeon.darktech.org/hg/dnetc_cuda/

In the latest version of the code, I no longer use the constant memory for the initializers. The initialization macros work as expected in the core. I originally needed to use the constant memory because the S was being placed in local memory and the initialization via macro would incur a bunch of writes out to local memory and then reads back. Now the S is replaced with individual S0…25 variables. These are all placed into registers by the compiler.

If you have access to your internal nvidia development tools, I would appreciate it if you could profile the core and let me know if there is anything would improve the performance. The code in the mercurial archive should build properly on linux, but it doesn’t have any of the necessary build changes necessary to support windows. Run “./dnetc -bench RC5-72 10”. There is an initial timing benchmark run that is performed to determine the optimum value passed into the core for “iterations”. Once you see % completes displayed, the core is in it’s normal running mode. The value passed in for “iterations” should remain relatively constant for the duration of the benchmark.

Thanks,

Paul

In your last post you said:

[Quote]

I tried to look at the problem so that there would be the least amount of global, local and shared memory accesses…

[\Quote]

Why are you avoiding shared memory? Access would be as fast as registers.

So you could have S0…S25 in shared memory (using S[0] to S[25] instead), and compute each one of them in a different thread simultaneously:

/* Initialize the S[] with constants */

#define KEY_INIT(_i) S[_i] = (P + (_i)*Q)

if ( tx < 26 )

  KEY_INIT(tx);

__syncthreads();

/* Drop out early if we don't have any data to process */

if( ((bx * bd) + tx) > process_amount) {

  /* Warning... Make sure you DON'T use  */

	/* __syncthreads() anywhere after this */

	/* point in the core!!!                */

  return;

}

It wouldn’t parallelize the bulk, but still may be non-trivial.

[quote name=‘Stewie’ date=‘Apr 9 2007, 11:53 AM’]

In your last post you said:

Although S0…S25 get initialized to the same starting values, they get clobbered differently be each instanciated thread (by the ROTL_BLOCK_* macros) so they either need to be separate variables or a large shared array of S[blockDim.x][26].

I chose not to use the shared array because of the possibility of bank conflicts. But, I will give it a try. I’ll make the shared array S[blockDim.x][32], to hopefully avoid any bank conflicts.

Direct link to the latest core code:

r72cuda1.cu

Whoops, I believe that should be: S[32][blockDim.x]… sorry…

Using shared u32 S[32][num_threads], the best performance I could achieve was 79 Mkeys/sec. I had to drop num_threads to 64 to get the core to fit into the 16 K of shared memory. Using the shared S, the cubin file says that 30 registers are being used. In the unmodified version of the core, 32 registers are used. So, unfortunately this doesn’t really reduce the number of registers that the core uses.

Looking at the ptx file, because shared memory is being used all of the ROTL blocks have an additional ld.shared.u32 and st.shared.u32. I believe that this is partially responsible for the performance degradation.

I was wondering if one of the nvidia employees could profile the core RC5_72 code? I have built a testbench project (linked below) the only consists of a main() and the core itself.

Run “./build.sh” to build the project.
Run “time ./test”. Use the resulting “real” time to calculate the keyrate with the following formula.
150*(30/{REAL_TIME})= MKeys/sec.

For reference, I see ~ 144-150 MKeys/sec on an 8800 GTX under linux.

http://dungeon.darktech.org/dnetc/r72cuda1…20070427.tar.gz

Thanks,

Paul

Any updates? Do you have a compiled binary I can use with my 8800GTX? Is the work the client completes accepted for RC5 officially?
TIA
Keep up the great work!