Slow perfomance when calculating SHA256

Calculating sha256 hash against 10MB data on GTX 1070 is taking just under 1 minute to complete.
Is this normal? I mean on CPU (using sha256sum command) I get the result almost instantly on the same 10MB input file.

I understand that sha256 algorithm can not be parallelized, only the number of concurrent threads can be increased. And I’ve tested that running the program on GPU with two 10MB files takes the same time as running the program with one file.

What I am not sure about, is the very big time difference (almost 1 min).

If interested, here is my code, nvprof output and specs. Oh, and Im runnign on Ubuntu 16.04 x64, with latest cuda.
https://github.com/Horkyze/CudaSHA256

==22686== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  57.8597s         1  57.8597s  57.8597s  57.8597s  sha256_cuda(JOB**, int)
                    0.00%     736ns         1     736ns     736ns     736ns  [CUDA memcpy HtoD]
      API calls:   99.54%  57.8597s         1  57.8597s  57.8597s  57.8597s  cudaDeviceSynchronize
                    0.31%  180.73ms        10  18.073ms  45.257us  179.20ms  cudaMallocManaged
                    0.15%  87.775ms         1  87.775ms  87.775ms  87.775ms  cudaDeviceReset
                    0.00%  428.18us        94  4.5550us     628ns  161.54us  cuDeviceGetAttribute
                    0.00%  118.66us         1  118.66us  118.66us  118.66us  cuDeviceTotalMem
                    0.00%  86.322us         1  86.322us  86.322us  86.322us  cudaLaunch
                    0.00%  50.983us         1  50.983us  50.983us  50.983us  cudaMemcpyToSymbol
                    0.00%  40.856us         1  40.856us  40.856us  40.856us  cuDeviceGetName
                    0.00%  34.007us        22  1.5450us     768ns  5.0980us  cudaGetLastError
                    0.00%  5.5870us         2  2.7930us     838ns  4.7490us  cudaSetupArgument
                    0.00%  3.8410us         3  1.2800us     768ns  2.0250us  cuDeviceGetCount
                    0.00%  2.0950us         2  1.0470us     908ns  1.1870us  cuDeviceGet
                    0.00%  1.8160us         1  1.8160us  1.8160us  1.8160us  cudaConfigureCall

My specs

./deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1070"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8114 MBytes (8507752448 bytes)
  (15) Multiprocessors, (128) CUDA Cores/MP:     1920 CUDA Cores
  GPU Max Clock rate:                            1785 MHz (1.78 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

You’ll want to compute thousands of SHA256 rounds in parallel. This is going to be fast.

If you’re working on a single 10MB data stream where the input state of each message block depends on the previous block’s output there’s really nothing a GPU can do to speed up your processing. This is inherently sequential.

If your test case consisted of running the entire 10MB hash single threaded on one thread block in a grid consisting of 1 blocks, then congratulations: you’ve used 1/1920th of your GPUs processing resources. Does this explain the runtime difference to you?

I understand that hashing one block is a ridiculous use case for GPU.

Im just wondering that it takes so long – still new to Cuda, so I dont yet know the performance baseline.
However my feeling is that 1 min for hashing 10MB file is just too long and the problem may be in my algorithm implementation.

Or if you guys on this forum say that 1 min is perfectly OK for such task, I can accept that :)

It may very well be that your implementation runs slowly on the GPU. But without you posting a complete, compilable piece of code there’s no way for us to tell. ;)

But you’re talking to the right person here. I’ve done a fair share of hash algorithms on GPU and optimized them for speed.

cbuchner1 <-- cudaminer, ccminer developer (until early 2015)

Good to hear that, I was actually inspired by cudaminer, well written :)

I’ve made some cosmetic changes to the code and repo - should be more clear now:
https://github.com/Horkyze/CudaSHA256

In the meantime I will try to rewrite the code for CPU only (no Cuda) and see is there is any increase in speed. I think it should hash as fast as sha256sum utility - if it does not, then the problem is the algorithm implementation - which is from here https://github.com/B-Con/crypto-algorithms/blob/master/sha256.c

From quickly looking at your code I see that you have plenty of operations that operate on arrays of words or bytes.

CUDA generally puts local array variables into local memory (which is like global memory in terms of access speed and latency). There are a lot of for loops in your code that iterate over such array elements. The only way to get your arrays to reside in registers is to fully unroll the for loops that iterate over your arrays. Registers will be an order of magnitude faster than local memory (depending a bit on how effective the L1 cache operates on your specific data access patterns).

It is also problematic to use byte arrays (size 64) for your message blocks. If possible declare such arrays in the native register width of CUDA (32 bits, i.e. uint32_t) even if it makes appending single message bytes more cumbersome to perform in code.

In cudaminer/ccminer we’d generally know the length of the data to hash at compilation time. Your code however is written to operate on streams of arbitrary length. It might be useful to have a fully optimized and unrolled sha256 round function for dealing with complete 64 byte message blocks and then have one not unrolled or optimized sha256 round function that deals with the few remaining bytes of the stream (plus any required padding).

Also there’s a pretty solid reason why a GPU based shasum utility might not be much faster than a CPU based one when operating on short (megabytes) streams: The required data PCIe based transfer to the GPU might be slower than just hashing it locally on the CPU (which may hold most of the stream in its caches already!) And then there’s also the issue of not being able to parallelize a single stream hash computation due to data dependencies.

BTW cudaminer is not so well written. It’s written to get the job done and not to look pretty ;-)

Thanks for the response, I find it very informative :)

> CUDA generally puts local array variables into local memory (which is like global memory in terms of access speed and latency).

By local memory you mean the “ram” for GPU and by global memory you mean the actual ram, that CPU uses?
For example consider this C calls:

void * p;
p = malloc(10);            // this allocates is global memory - RAM
cudaMalloc(&p, 10);        // this allocates is local (device) memory
cudaMallocManaged(&p, 10); // this allocates is unified memory, which is where? (on device or in RAM)

Is there any speed difference when accessing data allocated with cudaMalloc and cudaMallocManaged?

Also can it be argued that given my CPU frequency = 3.7 GHz and GPU frequency = 1.78, the same (one thread) code will run approximately 2x faster on CPU, neglecting the time for data transfer (host to device and then back)?

Global memory, local memory, shared memory, host memory, device memory are all well established CUDA terminology with very specific definitions. Please refer to the CUDA programming guide for the definitions.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

The best practices guide also has more information about these memory spaces on the GPU (in particular check the subsection about local memory)

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#device-memory-spaces

You cannot really compare performance per thread of a CPU and a GPU based on clock speed ratio alone. Individual instruction throughput plays a role, memory latencies and bandwidth, overal utilization of compute cores…

@matej.bellus Your code contains some errors.

In your JOB struct (same in JOB_init), you declared a 64 bytes digest which should be 32 bytes (256 bits)

Also, in your hash_to_string function, you do a malloc(70) (why not 65 ?) but you do not free memory.