Cuda sha256 calculations improvements

I’m new to cuda development, I tried to do a program which illustrates the bitcoin mining difficulty.

You enter a string then it finds an associated nonce prepended to the input string matching a sha256 with some number of zeros at the beggining.

For example if I want a difficulty of 6 and I put “moffa13” as input string, the program returns

6253010moffa13 which has this associated hash :

0000002dece0c0f5791305f53bfd5116966ea97a9604984cbb50891f243e5641 (6 zeros before)

My program can currently do 2-4MH/s.
I created the same program but for cpu and I can do over 6 millions hashes in a second.

I’m pretty sure I can optimize this but I don’t know how.

What I’m doing is that I run a for loop which starts a kernel and processes hashes. One thread does one hash as it can’t be parallelised. If one thread finds the right nonce, a global variable is set to 1 then the other threads return. After the kernel call, I run cudaDeviceSynchronize and I check if the global variable is set to 1. If not, I rerun the kernel with an updated nonce offset.

Is it the right way, calling multiple times the kernel ?
Maybe should I change my code’s design ?

Please tell me what should be changed in order to optimize this :)

Here’s the github : https://github.com/moffa13/SHA256CUDA
Here’s the cpu version : https://github.com/moffa13/SHA256Speed

Thanks !

Don’t do in-kernel malloc. Allocate outside of your kernel calling loop.

You’re only running 4 blocks? That’s going to be a perf limiter on nearly all GPUs.

what GPU are you running on?

Are you talking about the sha256.cuh file ?

I have a GTX 970 GPU.

I did like you said, I can now calculate up to 45 millions of hash/s (I can hear my gpu whistling) but my program crashes after ~30sec

How many blocks should I run parallel in a kernel ? I set 1024 threads in a block but don’t know how much blocks.

Also, what’s the difference between cudaSetDevice(0) & cudaSetDevice(1)
Can you give me a quick explanation of what can be achieved with cuda streams ?

A reasonable minimum target is to launch a total number of threads of at least # of SM * 2048.

These should be split between blocks with usually something in the range of 128,256, or 512 threads per block. It might be that 1024 threads per block is “OK”, it just requires some analysis to confirm.

Your GTX970 has 13 SMs, so target 13*2048 = 26K threads, ballpark, minimum. If you put 1024 threads per block, that would be 26 blocks. I’m not saying I know how to transform your code from 4 blocks to 26, but that is a reasonable performance goal, to maximize throughput.

I’m not going to try and teach you CUDA here on this forum. Use your google search to answer basic questions. There is documentation that will define what cudaSetDevice does:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb

There is a programming guide that discusses streams:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#streams

and of course there are a bazillion additional writeups and resources all over the web.

can you just start with existing sha256 GPU calculation code? there are lot of mining algo sources on githuи

and anyway, you need to learn CUDA in order to get idea how GPU works and how code should be optimized. otherwise, you easily can get worse performance than on CPU

Correct me if I’m wrong.

I don’t understand why you are talking about 4 blocks because I run 10 blocks of 1024 threads.

If I run a single kernel, there is only one SM working ? So the max efficiency would be 2048 threads so it means 2 blocks with 1024 threads each.

If i use 13 streams will the hashing power be 13x faster ?

I looked at this:

#define BLOCK_SIZE 1024
#define SHA_PER_ITERATIONS 3200
#define NUMBLOCKS (SHA_PER_ITERATIONS + BLOCK_SIZE - 1) / BLOCK_SIZE

sha256_kernel << < NUMBLOCKS, BLOCK_SIZE >> > (g_out, g_hash_out, g_found, d_in, input_size, difficulty, nonce);

(3200 + 1024 - 1)/1024 = 4

if you don’t believe me, print out those quantities.

No, not correct. A kernel with enough blocks will fill all the SMs on a GPU, and this is generally desirable. Please avail yourself of an organized introduction to CUDA. I know you’d like to learn it in 5 minutes, but in my experience it takes longer than that. Piecemeal Q+A “Socratic” method might seem “efficient” but is actually quite inefficient for learning a body of knowledge like this, IMO.

A basic intro sequence to CUDA might be:

part1: http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0624-Monday-Introduction-to-CUDA-C.pdf

part2: http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0514-GTC2012-GPU-Performance-Analysis.pdf

If you want to learn what streams are for, google “gtc cuda streams” and take the first hit.

It’s amazing how prevalent this line of thinking is when it comes to streams. If that were true, I’d advise you not to stop at 13.

no, each next thread block will go to the next SM

this just shows how deep is your misunderstanding of GPU execution model. do we need to read aloud entire CUDA manual or you can read it yourself?

@txbob Oh sorry this is because I changed this value (3200) to 10240 between the posts.

I already read the whole part1 but it’s a really basic introduction ^^

Thanks for the second one I’ll read it.

the github repo is the only thing I have to look at. It still says 3200 as of this posting, at this moment. The code I excerpted was pulled from your github repo, and still reflects that.

Obviously I can’t see what code you are actually running.

Roughly speaking, the “part1” I posted teaches you how to write syntactically correct CUDA, that will give you the correct answer (arguably the first step/requirement for any programmer). The “part2” teaches the basics of how to write CUDA code that runs “fast”. Your questions here mostly revolve around the latter, not the former.

With a lack of understanding of the latter, you can easily write code that gives you the correct answer but runs slow. You may then arrive at the conclusion that GPUs are worthless. Presumably they are not.

Hello, I still have a question.

As each thread can find a correct answer to return to the host, how to only make one thread pass through the if statement which copies the data back to the host and avoid race conditions ?

Example :

bool isOk = check(someVar); // Many different someVar can work
if(isOk){
  // I want this only accessed by one thread, if possible the one with the lowest id
  memcpy(h_someVar, someVar, x);
}

Edit: Never mind, I did this

bool isOk = check(someVar); // Many different someVar can work
if(isOk && atomicExch(g_found, 1) == 0){
  // I want this only accessed by one thread, if possible the one with the lowest id
  memcpy(h_someVar, someVar, x);
}

Dear @Robert_Crovella

Can you help me with this kind of problem, but using Python?
I mean is there any kind of example SHA256 Calculation using Python in implementation with CUDA?

Thank you very much for your kind attention.

I don’t know of any examples. They may exist somewhere. The common CUDA-python language bindings I am familiar with are pycuda, numba, and cupy. It should be possible to write a sha256 calculation using any of those. For example, if you had an already-written CUDA C++ kernel (such as the one on github linked in this thread - sha256_kernel) you can use that directly in pycuda. The process is straightforward, you use the kernel code directly, and the conversion of host code from C++ to python suitable for pycuda is basically mechanical. I won’t be able to give a tutorial on how to use pycuda here. That info is available elsewhere, though.

Thank you very much for your quick response.
I will be trying hard to figure this out.
Will try with pycuda or numba or cupy.