Any simple SHA256 and RIPEMD160 CUDA implementations I can try?

Hello,

I am very new to CUDA programming, although I have some experience with parallel programming finding 56-bit DES crypto keys, and simple things like that.

I have some good books and a reasonable understanding of host and device memory, the usual beginner programs like vector addition, cudaMalloc, cudaFree, cudaMemcpy, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, sync threads, cudaDeviceSynchronize(), but just the bare basics.

I’m trying to wrap my head around the kernel()<<<X, Y>>> parts now, for later, then figuring out how I can just get the compute capability of a GTX 1060 and use safe values to start with. Then dynamically query all cards later on and compute on the fly what I can do with kernel<<<X, Y>>>(), e.g.

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

But first I will try to keep it simple with kernelfunc<<<1, 256>>> or something like that, or just <<<1, 1>>> in the early stages of testing in case I run into issues with any race conditions and read/write same memory locations.

Back to the main topic, to practice with I like to create cryptocurrency brainwallets. I find my books have the most boring and obscure topics, so I am not learning anything, and getting stuck. They usually center around graphics and mathematical topics, which I don’t understand or find interesting. It’s not their fault of course, I just need to find topics of interest to me so I can move on.

The first task for me is to find a sha256 kernel and ripemd160 kernel (maybe translate some C code and create them, but I don’t know where to find the specs of these hash functions) that I can test with, and verify the hash outputs are correct.

I had in mind something like this, as easy to use as the OpenSSL functions (or as close as possible!):

//
// Maybe later a vector<std::string> passphrases
//
//  unsigned char tmp_hash[32];
//
//  Pass in size of vector and try to unroll that loop
//  for (size_t i = 0; i < vec_size; i++)
//
//     Do the SHA256 on each
//     SHA256(vec[i], tmp_hash)
//     memcpy(tmp_hash, out_hash_vec[i], 32);
//
//     Stash them back in an array of unsigned char* with 32 bytes for the hash
//

__global__ void brainwallet_keys(const unsigned char *in_passphrase, const size_t length, unsigned char *out_hash)
{
    SHA256_CTX *ctx;
    SHA256_Init(&ctx);
    SHA256_Update(&ctx, in_passphrase, length);
    SHA256_Final(out_hash, &ctx);
}

int main()
{
   std::cout << "Enter brainwallet passphrase> ";
   std::string s;
   std::getline(std::cin, s);

   const size_t pass_size = s.length();

   // password for host to give to device
   // host hash device gives to us

   unsigned char host_password[pass_size + 1];
   unsigned char host_hash[SHA256_DIGEST_LENGTH];

   // password for device, device memory for hash output
   unsigned char *dev_password, *dev_hash;

   memset(host_password, 0, pass_size + 1);
   memcpy(host_password, s.c_str(), pass_size);

   // alloc memory for device passowrd, copy it from host to device memory 
   cudaMalloc( (unsigned char**)&dev_password, pass_size );
   cudaMemcpy( dev_password, host_password, pass_size, cudaMemcpyHostToDevice );

   // alloc memory for the output hash
   // no copy across because it's output only?
   cudaMalloc( (unsigned char**)&dev_hash, SHA256_DIGEST_SIZE );

   brainwallet_keys<<<1, 1>>>(dev_password, pass_size, dev_hash);

   // do I need this here?
   cudaDeviceSyncrhonize();

   cudaMemcpy(host_hash, dev_hash, SHA256_DIGEST_SIZE, cudaMemcpyDeviceToHost);
   
   // Deallocate memory, host hash safely stashed away
   cudaFree(dev_password);
   cudaFree(dev_hash);

   cudaDeviceReset();

   // Print out hash in hex from host_hash array
   ...

   return 0;
}

Sorry this is so ugly and hackish, I’m trying to illustrate what I am trying to achieve first before I get on with more advanced topics and finishing the brainwallets, printing the uncompressed and compressed pubkey variants, their 1Btc addresses, and WIF keys. I will get that later, but first hash routines, whether I have basically got this right allocating, copy and back and forth, sync, free memory, device reset.

Thank you for any help!

“Simple” is in the eye of the beholder. A Google search should quickly reveal multiple implementations, pick the one you like the best. I found these on the first page of my Google search results:

Forum participant allanmac has a SHA256 implementation here:

[url]https://gist.github.com/allanmac/8745837[/url]

Forum participant cbuchner has a SHA256 implementation here:

[url]https://github.com/cbuchner1/CudaMiner/blob/master/sha256.cu[/url]

Thank you!

I will go and have a look. I haven’t seen these ones yet.

Neither are suitable. The first one was promising, but old code for 32-bit machines, with 32-bit assembler. I am using CUDA 10 with GTX 1060 cards on 64-bit Ubuntu.

I am not sure I could properly translate the 32-bit code into 64-bit code…

The second one is for a miner, not just straight sha256 or ripemd160, that’s the problem I’m having. I just need to hash, and launch a kernel that will perhaps eventually launch a bunch of hash routine kernels all running at once, if I can, that’s the goal. And verify the hash output is correct for each string input. So the hash function has to accept char arrays.

I will keep looking, thank you…

You may not have a good enough grasp of CUDA GPUs to make this statement correctly.

I’m not an expert in this space, and maybe allanmac will come along to clarify, but I wouldn’t be surprised if the choice to use -m32 there was a carefully chosen option to optimize performance of the code. Basically forcing the machine to use 32-bit pointers to limit register pressure and index calculation intensity. It looks like exquisite code to me, and by that I mean carefully crafted by someone who is well-practiced in the art (of CUDA coding. I can’t speak to the SHA256 stuff). Furthermore, 32-bit machine mode in CUDA at least is still capable of handling 64-bit data types, such as long long.

CUDA GPUs are largely 32-bit machines anyway. The only 64-bit operation they have native support for that I can think of offhand is 64-bit floating-point multiply-add, which is not relevant here. All integer operations of any type (including 64-bit) are performed by concatenating 32-bit operations together, in any machine model.

If you found that code to be interesting, maybe you will come back later to it and find it is useful after all.

Hi, thank you Robert.

I do believe that code was expertly crafted, although nvcc (invoking it directly, not through clang or another compiler) won’t allow it. I had a message to the effect that I can’t use -m 32 anymore for 32-bit machine development. This must target 32-bit environments, I have to look more into the CUDA toolkit and nvcc options and what they all really mean.

It’s very interesting these are largely 32-bit machines. I didn’t know that!

I will come back to the code, especially if I can figure out why nvcc is telling me 32-bit development isn’t supported anymore when I try to compile. Perhaps this is a host issue.

I would think with pseudo-code implementations described here:

that k[0…63] has to be marked global (device and host?) or constant?

Usually in pure C code, when I’ve looked around, it’s

static const uint32_t k[] = { ... }

I’ve also seen a declaration of an empty dev_k, but no clues on how this gets to device memory so far.

When calling a kernel, does all input have to be allocated and copied into device memory? And all helper functions marked as device ? Unless helpers are macros and expanded in the code before compilation.

Suppose there is a kernel:

unsigned char* dev_hash;
dev_hash = cudaMalloc( (unsigned char**)&dev_hash, 32);

__global__ hash_me(const unsigned char * in_message, const size_t in_mlength, unsigned char *out_hash);

And calling it in some way:

hash_me<<<,>>>("abc", 3, dev_hash);

where “abc” and 3 initialized in host code (and memory), and dev_hash is created:

… later is copied across to host memory:
… assume host_hash is already existing:

cudaMemcpy(host_hash, dev_hash, 32, cudaMemcpyDeviceToHost);
cudaFree(dev_hash);

is that something which will work? I’m now confused by all the different ways I’ve seen CUDA kernels done in slides, books, on the Internet. What I think is that the host variables have to be copied across into device memory first, all of them.

But now I am not sure. It brings me back to the k[0…63] and whether this is properly created as global in some way for the hash kernel to use it, and in device memory, in the pure C to CUDA implementations I’ve looked at so far. Some code seems to be missing, like the initialization step, and what happens with the corresponding dev_k, and whether it is sufficent to declare this constant and then:

cudaMemcpy(dev_k, k, size, cudaMemcpyHostToDevice);

I will probably have to implement the sha code myself and carefully translate it into a terrible and sloppy CUDA kernel just to get it running once, to see if it produces the correct output. Here I have to be careful with pre-process steps and padding, and breaking it up into N bit blocks; or have a helper function that breaks up a character string input into the correct number of blocks to make the sha256 code a little easier to deal with…

global is never used to mark data in CUDA.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global