Optimizing SHA256 algorithm

Hello, CUDA beginner here.

I am working on a student project and I have a special use case for the SHA256 algorithm. I need to calculate as much hashes as possible for a really short duration. The good thing is, that only 8 bytes of the entire message is changing. Usually its said that a tiny change of the message results in a acceptable change in the output hash. Which is of course true, but I can use the way the algorithm calculates the hash. Here a short summary how the SHA256 works:

  1. Initialization: The original data is divided into blocks of 512 bits and a constant initialization vector (IV) is used.
  2. Pre-processing: Each block undergoes pre-processing, generating a bit sequence called the “message schedule”. This includes expanding the block to 64 32-bit words and initializing constants “W”.
  3. Main processing: The message schedules are fed one-by-one through the main processing algorithm. Here, a hash value consisting of 256 bits is generated for each block. This hash value is used as a new message schedule for the next block.
  4. Finalization: After all blocks have gone through the main processing algorithm, a final hash value is generated. This uses the last message schedule and the number of processed bits in the data.
  5. Output: The final hash value is used as the digital signature for the original data.

As code it usually is implemented with four functions.

  1. sha256_init() → initializes the required structure with variables and loads the initial hash state
  2. sha256_update() → copies the data into the operating fields, once 64 bytes(512 bit) is reached it calls 3.
  3. sha256_transform() → the actual algorithm, transforms the data and applies the hashing and stores the state
  4. sha256_final() → finalizes the hash by adding the padding and the message counter

In my Use-Case I can get rid of most of the functionality. I only have one chunk(64byte - 512bit) with a sequence counter as unsigned long (8byte) in the beginning(position 0 - 7). The remaining message already contains the padding and the message length. However my initial hash state is a different one and I need to use this one to calculate the result. The goal is to find a hash close to a defined target hash. So I need to compare the current hash with the target and the best hash sofar. ( target < current < best )

I call this new function sha256_transform_opt. It takes the initial state as an array, the current seqeuence counter called nonce, the remaining 56 bytes as a WORD array and finally a output state array.

The original Implementation I used is this one.

Here is the “core code”, I removed all the imports and only show the important part (IMO).

__device__ void sha256_transform_opt(const WORD init_state[], const unsigned long nonce, const WORD data[], WORD out_state[])
   WORD a, b, c, d, e, f, g, h, i, t1, t2, m[64];
// add the current nonce at the beginning of the m array
// I need to shift the nonce
   unsigned char *tmp = (unsigned char *)&nonce;
   m[0] = (tmp[0] << 24) | (tmp[1] << 16) | (tmp[2] << 8) | (tmp[3]);
   m[1] = (tmp[4] << 24) | (tmp[5] << 16) | (tmp[6] << 8) | (tmp[7]);
// copy over the remaining static data
   m[2] = data[0];
   m[3] = data[1];
   m[4] = data[2];
   m[5] = data[3];
   m[6] = data[4];
   m[7] = data[5];
   m[8] = data[6];
   m[9] = data[7];
   m[10] = data[8];
   m[11] = data[9];
   m[12] = data[10];
   m[13] = data[11];
   m[14] = data[12];
   m[15] = data[13];
// store init state in tmp variables
  	a = init_state[0];
  	b = init_state[1];
  	c = init_state[2];
  	d = init_state[3];
  	e = init_state[4];
  	f = init_state[5];
  	g = init_state[6];
  	h = init_state[7];

// directly start with calculating the next state for the first 16 words
#pragma unroll 16
   for (i=0; i < 16; ++i){
       t1 = h + EP1(e) + CH(e, f, g) + dev_k[i] + m[i];
       t2 = EP0(a) + MAJ(a, b, c);
       h = g;
       g = f;
       f = e;
       e = d + t1;
       d = c;
       c = b;
       b = a;
       a = t1 + t2;
// continue with the remaining 48 words but first update the m array at this position
#pragma unroll 64
   for (i=16; i < 64; ++i){
       m[i] = SIG1(m[i - 2]) + m[i - 7] + SIG0(m[i - 15]) + m[i - 16];
       t1 = h + EP1(e) + CH(e, f, g) + dev_k[i] + m[i];
       t2 = EP0(a) + MAJ(a, b, c);
       h = g;
       g = f;
       f = e;
       e = d + t1;
       d = c;
       c = b;
       b = a;
       a = t1 + t2;
// do not override init_state its used by other threads too
   out_state[0] = init_state[0] + a;
   out_state[1] = init_state[1] + b;
   out_state[2] = init_state[2] + c;
   out_state[3] = init_state[3] + d;
   out_state[4] = init_state[4] + e;
   out_state[5] = init_state[5] + f;
   out_state[6] = init_state[6] + g;
   out_state[7] = init_state[7] + h;


__device__ char is_b_bigger_than_a(unsigned char *a, unsigned char *b)
// Hashes are compared as big endian (requirement)
   for(char i = 28; i >= 0; i=i-4){
       for(char k = 0; k < 4; ++k){
           unsigned char a_byte = a[i + k];
           unsigned char b_byte = b[i + k];
           //printf("%02x < %02x\n", a_byte, b_byte);
           if(a_byte < b_byte){
               return 1;
           }else if(a_byte == b_byte){
               return 0;
   return 0;

__device__ char is_current_better(WORD t[], WORD c[], WORD b[])
// use XOR for the first WORD(4 bytes) only if they are zero start a byte by byte check
   WORD tmp = t[7] ^ c[7];
   if(tmp != 0){
       return 0;

   if(is_b_bigger_than_a((unsigned char *) t, (unsigned char *) c) && is_b_bigger_than_a((unsigned char *)c, (unsigned char *) b)){
       return 1;

   return 0;

extern "C"
__global__ void sha256_kernel_opt(WORD best_found_hash[], const WORD m_init[], const WORD init_state[], uint64_t * nonce_offset, WORD target[], uint64_t * best_nonce) {
   uint64_t idx = blockIdx.x * blockDim.x + threadIdx.x;
   uint64_t nonce = idx + *nonce_offset;
   WORD out_state[8];
   sha256_transform_opt(init_state, nonce, m_init, out_state);
   if(is_current_better(target, out_state, best_found_hash)){
       memcpy(best_nonce, &nonce, 8);
#pragma unroll 8
       for(int i = 0; i < 8; ++i){
           best_found_hash[i] = out_state[i];

Example values, I calculated the first 300_000_000_000 possibilities @ 3,230,124,702 H/s. The correctness of the entire hash is checked and already verified by passing the full message with nonce to the cryptolib in C.

target_hash: 0xde6f564523e9460388456e3bdf3e8171a6d8dc47fb63f6c0ee261f83f4a0d7aa
init_state= [
nonce_offset = 0

data = 0x8000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000440

-> best_nonce = 15_120_750_820
-> best_hash = 0xde6f564531a568a01c80dfbdfab6e6b5589a8e90ea4c03c4cb722d34cfab68bd
difference to target : 0.000000001438 %

I made some tests with different GPUs, the more CUDA cores the better the result and therefore the 4090 was obviously the best one. If you considere the price the 2070 super results in the best price / performance ratio.

GeForce GTX 1060 6GB    -> 1,159bn H/s
GeForce GTX 2070 Super  -> 3,700bn H/s
GeForce RTX 3060 Ti 8GB -> 3,200bn H/s
GeFroce GTX 1080Ti 11GB -> 2,400bn H/s
GeForce RTX 4090        -> 17,00bn H/s
A100                    -> 7,000bn H/s
A4000                   -> 3,800bn H/s
A5000                   -> 5,400bn H/s
A6000                   -> 7,300bn H/s

So my question is, do I miss something obvious to optimize? Is there something else to optimize?

I use pycuda to call my kernal function if there is something to optimize too.

One optimization I tried, but decreased the performance drastically, was a loop inside the kernel to directly calculate multiple nonces in sequence. My thought was to reduce the overhead of calling the kernel with a single nonce each time.

1 Like

pycuda is good extension on python to use with cuda functions, but I fear it lack the speed of C language itself.

I’m interested in generating the code myself. Could you please share the pycuda code somewhere like git so I could regenerate the result and comment more?

also I’m bit confused what do you mean by “bn H/s”?

bn H/s means billion (1e9) hashes per second. I also assume that the OP used a decimal comma, not a decimal point.

The pycuda overhead should be negligible here, as most of the time is spent in the compute kernel.

This pragma unroll in OP’s code should probably be an unroll 48, as the loop starts at count 16.

#pragma unroll 64

Also I am not sure if I like the

WORD m[64]

as a local variable. I am unsure if such a large array can get promoted to registers when fully unrolling the loops.

Has OP ever looked at the PTX or SASS code for this kernel? Also did you ever run kernel profiling tools?

Looking at the register count for the compiled kernel could give some clues. Run the occupancy calculator on it to see if trying to lower the register count might result in higher occupancy on the GPU.


yes you are right with billion H/s.

I fixed the pragma unroll already, it seems it had no impact at all.

For PTX or SASS I did not check this yet, since this is my first time ever writing a CUDA kernel. I did run a profiling tool (NVidia Nsight), the overhead for pycuda is negligible. But other than that I could not figure out anything else useful.