Hello, CUDA beginner here.

I am working on a student project and I have a special use case for the SHA256 algorithm. I need to calculate as much hashes as possible for a really short duration. The good thing is, that only 8 bytes of the entire message is changing. Usually its said that a tiny change of the message results in a acceptable change in the output hash. Which is of course true, but I can use the way the algorithm calculates the hash. Here a short summary how the SHA256 works:

- Initialization: The original data is divided into blocks of 512 bits and a constant initialization vector (IV) is used.
- Pre-processing: Each block undergoes pre-processing, generating a bit sequence called the â€śmessage scheduleâ€ť. This includes expanding the block to 64 32-bit words and initializing constants â€śWâ€ť.
- Main processing: The message schedules are fed one-by-one through the main processing algorithm. Here, a hash value consisting of 256 bits is generated for each block. This hash value is used as a new message schedule for the next block.
- Finalization: After all blocks have gone through the main processing algorithm, a final hash value is generated. This uses the last message schedule and the number of processed bits in the data.
- Output: The final hash value is used as the digital signature for the original data.

As code it usually is implemented with four functions.

- sha256_init() â†’ initializes the required structure with variables and loads the initial hash state
- sha256_update() â†’ copies the data into the operating fields, once 64 bytes(512 bit) is reached it calls 3.
- sha256_transform() â†’ the actual algorithm, transforms the data and applies the hashing and stores the state
- sha256_final() â†’ finalizes the hash by adding the padding and the message counter

In my Use-Case I can get rid of most of the functionality. I only have one chunk(64byte - 512bit) with a sequence counter as unsigned long (8byte) in the beginning(position 0 - 7). The remaining message already contains the padding and the message length. However my initial hash state is a different one and I need to use this one to calculate the result. The goal is to find a hash close to a defined target hash. So I need to compare the current hash with the target and the best hash sofar. ( target < current < best )

I call this new function sha256_transform_opt. It takes the initial state as an array, the current seqeuence counter called nonce, the remaining 56 bytes as a WORD array and finally a output state array.

The original Implementation I used is this one.

Here is the â€ścore codeâ€ť, I removed all the imports and only show the important part (IMO).

```
__device__ void sha256_transform_opt(const WORD init_state[], const unsigned long nonce, const WORD data[], WORD out_state[])
{
WORD a, b, c, d, e, f, g, h, i, t1, t2, m[64];
// add the current nonce at the beginning of the m array
// I need to shift the nonce
unsigned char *tmp = (unsigned char *)&nonce;
m[0] = (tmp[0] << 24) | (tmp[1] << 16) | (tmp[2] << 8) | (tmp[3]);
m[1] = (tmp[4] << 24) | (tmp[5] << 16) | (tmp[6] << 8) | (tmp[7]);
// copy over the remaining static data
m[2] = data[0];
m[3] = data[1];
m[4] = data[2];
m[5] = data[3];
m[6] = data[4];
m[7] = data[5];
m[8] = data[6];
m[9] = data[7];
m[10] = data[8];
m[11] = data[9];
m[12] = data[10];
m[13] = data[11];
m[14] = data[12];
m[15] = data[13];
// store init state in tmp variables
a = init_state[0];
b = init_state[1];
c = init_state[2];
d = init_state[3];
e = init_state[4];
f = init_state[5];
g = init_state[6];
h = init_state[7];
// directly start with calculating the next state for the first 16 words
#pragma unroll 16
for (i=0; i < 16; ++i){
t1 = h + EP1(e) + CH(e, f, g) + dev_k[i] + m[i];
t2 = EP0(a) + MAJ(a, b, c);
h = g;
g = f;
f = e;
e = d + t1;
d = c;
c = b;
b = a;
a = t1 + t2;
}
// continue with the remaining 48 words but first update the m array at this position
#pragma unroll 64
for (i=16; i < 64; ++i){
m[i] = SIG1(m[i - 2]) + m[i - 7] + SIG0(m[i - 15]) + m[i - 16];
t1 = h + EP1(e) + CH(e, f, g) + dev_k[i] + m[i];
t2 = EP0(a) + MAJ(a, b, c);
h = g;
g = f;
f = e;
e = d + t1;
d = c;
c = b;
b = a;
a = t1 + t2;
}
// do not override init_state its used by other threads too
out_state[0] = init_state[0] + a;
out_state[1] = init_state[1] + b;
out_state[2] = init_state[2] + c;
out_state[3] = init_state[3] + d;
out_state[4] = init_state[4] + e;
out_state[5] = init_state[5] + f;
out_state[6] = init_state[6] + g;
out_state[7] = init_state[7] + h;
}
__device__ char is_b_bigger_than_a(unsigned char *a, unsigned char *b)
{
// Hashes are compared as big endian (requirement)
for(char i = 28; i >= 0; i=i-4){
for(char k = 0; k < 4; ++k){
unsigned char a_byte = a[i + k];
unsigned char b_byte = b[i + k];
//printf("%02x < %02x\n", a_byte, b_byte);
if(a_byte < b_byte){
return 1;
}else if(a_byte == b_byte){
continue;
}
else{
return 0;
}
}
}
return 0;
}
__device__ char is_current_better(WORD t[], WORD c[], WORD b[])
{
// use XOR for the first WORD(4 bytes) only if they are zero start a byte by byte check
WORD tmp = t[7] ^ c[7];
if(tmp != 0){
return 0;
}
if(is_b_bigger_than_a((unsigned char *) t, (unsigned char *) c) && is_b_bigger_than_a((unsigned char *)c, (unsigned char *) b)){
return 1;
}
return 0;
}
extern "C"
__global__ void sha256_kernel_opt(WORD best_found_hash[], const WORD m_init[], const WORD init_state[], uint64_t * nonce_offset, WORD target[], uint64_t * best_nonce) {
uint64_t idx = blockIdx.x * blockDim.x + threadIdx.x;
uint64_t nonce = idx + *nonce_offset;
WORD out_state[8];
sha256_transform_opt(init_state, nonce, m_init, out_state);
if(is_current_better(target, out_state, best_found_hash)){
memcpy(best_nonce, &nonce, 8);
#pragma unroll 8
for(int i = 0; i < 8; ++i){
best_found_hash[i] = out_state[i];
}
}
}
```

Example values, I calculated the first 300_000_000_000 possibilities @ 3,230,124,702 H/s. The correctness of the entire hash is checked and already verified by passing the full message with nonce to the cryptolib in C.

```
target_hash: 0xde6f564523e9460388456e3bdf3e8171a6d8dc47fb63f6c0ee261f83f4a0d7aa
init_state= [
0x2be3b4a4
0x8d7b1350
0xb720e55e
0xc552d0ed
0xd67fab7a
0x7b2c2d06
0x0cf92f52
0x898ea3a5
]
nonce_offset = 0
data = 0x8000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000440
-> best_nonce = 15_120_750_820
-> best_hash = 0xde6f564531a568a01c80dfbdfab6e6b5589a8e90ea4c03c4cb722d34cfab68bd
difference to target : 0.000000001438 %
```

I made some tests with different GPUs, the more CUDA cores the better the result and therefore the 4090 was obviously the best one. If you considere the price the 2070 super results in the best price / performance ratio.

```
GeForce GTX 1060 6GB -> 1,159bn H/s
GeForce GTX 2070 Super -> 3,700bn H/s
GeForce RTX 3060 Ti 8GB -> 3,200bn H/s
GeFroce GTX 1080Ti 11GB -> 2,400bn H/s
GeForce RTX 4090 -> 17,00bn H/s
A100 -> 7,000bn H/s
A4000 -> 3,800bn H/s
A5000 -> 5,400bn H/s
A6000 -> 7,300bn H/s
```

So my question is, do I miss something obvious to optimize? Is there something else to optimize?

I use pycuda to call my kernal function if there is something to optimize too.

One optimization I tried, but decreased the performance drastically, was a loop inside the kernel to directly calculate multiple nonces in sequence. My thought was to reduce the overhead of calling the kernel with a single nonce each time.