AVX2 code conversion to CUDA

How to modify C++ AVX2 program for CUDA?

__m256i indata[4];
__m256i key;
key = _mm256_setr_epi32(0x41686f6a, 0x31323334, 0, 0, 0, 0, 0, 0); 
indata[0] = _mm256_xor_si256(key, _mm256_set1_epi8(0x36));
indata[1] = _mm256_set1_epi8(0x36);   // 0 XOR 0x36 = 0x36
indata[2] = _mm256_setr_epi32(0x504d4b20, 0x4e616d65, 0xc0c1c04b, 0xfc68fcc2, 0x330136c6, 0x80000000, 0, 0);
indata[3] = _mm256_setr_epi32(0, 0, 0, 0, 0, 0, 0, (64 + 20) * 8);

I am unsure if I should use CUDA vector variables, e.g. uint1 (4 bytes) or standard variable for array i.e. uint32_t[8] equivalent of __m256i.

Most of the code you’re showing initializes indata with constants. The only actual arithmetic operation is the calculation of indata[0], where it performs an xor with 0x36363636 on each of the 8 32-bit data lanes. But I assume the actual code is longer than what you are showing.

AVX2 has more data lanes than the CUDA builtin vector types int4/uint4 allow for. And there is currently no built-in 256 bit CUDA vector type. Working with structs or unions that contain a uint32_t[8] or alternatively uint4[2] might work.

You could write some AVX2 emulation routines that operate on said structs, so you can keep the business logic of the code unchanged.

The CUDA samples contain a header files called helper_math.h which provides some arithmetic overloads for the existing CUDA vector types such as uint4. Unfortunately no bit operations like XOR are present, but based on the existing code you could easily implement your own version of the ^ operator.

If you are unsure what the individual AVX2 intrinsics do, refer to the excellent Intel Intrinsics Guide. There is pseudocode for almost every intrinsic given, illustrating what it does.

My suggestion would be to use 1 cuda thread per avx lane (func). The only ugly thing is converting the “set” instructions without using local memory (func2).

#include <cooperative_groups.h>

namespace cg = cooperative_groups;

__device__
int func(){

    auto group = cg::tiled_partition<8>(cg::this_thread_block());

    int indata[4];
    int key = 0;

    {
        constexpr int values[8]{0x41686f6a, 0x31323334, 0, 0, 0, 0, 0, 0};
        key = values[8-1-group.thread_rank()];
    }

    indata[0] = key ^ 0x36363636;
    indata[1] = 0x36363636;

    {
        constexpr int values[8]{0x504d4b20, 0x4e616d65, 0xc0c1c04b, 0xfc68fcc2, 0x330136c6, 0x80000000, 0, 0};
        indata[2] = values[8-1-group.thread_rank()];
    }

    {
        constexpr int values[8]{0, 0, 0, 0, 0, 0, 0, (64 + 20) * 8};
        indata[3] = values[8-1-group.thread_rank()];
    }

    return indata[0] + indata[1] + indata[2] + indata[3];
}

__device__
int func2(){

    auto group = cg::tiled_partition<8>(cg::this_thread_block());

    int indata[4];
    int key = 0;

    if(group.thread_rank() == 7){ key = 0x41686f6a; }
    else if(group.thread_rank() == 6){ key = 0x31323334; }

    indata[0] = key ^ 0x36363636;
    indata[1] = 0x36363636;

    {
        constexpr int values[8]{0x504d4b20, 0x4e616d65, 0xc0c1c04b, 0xfc68fcc2, 0x330136c6, 0x80000000, 0, 0};
        #pragma unroll
        for(int i = 0; i < 8; i++){
            if(i == 8-1-group.thread_rank()){
                indata[2] = values[i];
                break;
            }
        }
    }

    if(group.thread_rank() == 0){
        indata[3] = (64 + 20) * 8;
    }else{
        indata[3] = 0;
    }


    return indata[0] + indata[1] + indata[2] + indata[3];
}

My guess is that the above code is part of a cryptographic hash function, maybe cryptocurrency related. One does not typically SIMD-optimize a cryptographic hash without the need of running many such hashes in parallel. The code may be intending to brute force the hash (e.g for mining or password recovery) by running many simultaneous hashing operations. In this case it should not really matter if a CUDA thread per AVX2 lane is used or if one CUDA thread processes the equivalent of 8 lanes. There will be a difference in per thread register use and occupancy. Only benchmarking will show which alternative is faster.

Update: the code is part of a SHA1 hash? See HMAC, SHA1 (C++/AVX2 VS2022) In this case it might be easier to base the CUDA version on a non SIMD optimized version of said hash function.

This. Generally speaking, the point of CUDA is to write code in per-thread scalar fashion and let the SIMT hardware worry about parallel execution. Introducing notions of explicit SIMD-ness into CUDA programs is usually not a good idea, with some exceptions for data types smaller than 32 bit.

As @cbuchner1 says, it is therefore usually most appropriate to start the design effort for a CUDA port from an existing scalar implementation rather than a version optimized for some existing fixed-width explicit-SIMD architecture.

If this is from a SHA1 implementation, it would probably be a good idea to search the internet for existing CUDA implementations, of which there should be plenty. I am pretty sure I came across CUDA-based “SHA1-crackers” more than a decade ago, and the introducion of the LOP3 instruction should have given a significant boost to such codes.