AVX2 code conversion to CUDA

tiputa · February 25, 2023, 11:20pm

How to modify C++ AVX2 program for CUDA?

__m256i indata[4];
__m256i key;
key = _mm256_setr_epi32(0x41686f6a, 0x31323334, 0, 0, 0, 0, 0, 0); 
indata[0] = _mm256_xor_si256(key, _mm256_set1_epi8(0x36));
indata[1] = _mm256_set1_epi8(0x36);   // 0 XOR 0x36 = 0x36
indata[2] = _mm256_setr_epi32(0x504d4b20, 0x4e616d65, 0xc0c1c04b, 0xfc68fcc2, 0x330136c6, 0x80000000, 0, 0);
indata[3] = _mm256_setr_epi32(0, 0, 0, 0, 0, 0, 0, (64 + 20) * 8);

I am unsure if I should use CUDA vector variables, e.g. uint1 (4 bytes) or standard variable for array i.e. uint32_t[8] equivalent of __m256i.

cbuchner1 · February 27, 2023, 12:29pm

Most of the code you’re showing initializes indata with constants. The only actual arithmetic operation is the calculation of indata[0], where it performs an xor with 0x36363636 on each of the 8 32-bit data lanes. But I assume the actual code is longer than what you are showing.

AVX2 has more data lanes than the CUDA builtin vector types int4/uint4 allow for. And there is currently no built-in 256 bit CUDA vector type. Working with structs or unions that contain a uint32_t[8] or alternatively uint4[2] might work.

You could write some AVX2 emulation routines that operate on said structs, so you can keep the business logic of the code unchanged.

The CUDA samples contain a header files called helper_math.h which provides some arithmetic overloads for the existing CUDA vector types such as uint4. Unfortunately no bit operations like XOR are present, but based on the existing code you could easily implement your own version of the ^ operator.

If you are unsure what the individual AVX2 intrinsics do, refer to the excellent Intel Intrinsics Guide. There is pseudocode for almost every intrinsic given, illustrating what it does.

striker159 · February 28, 2023, 8:05am

My suggestion would be to use 1 cuda thread per avx lane (func). The only ugly thing is converting the “set” instructions without using local memory (func2).

#include <cooperative_groups.h>

namespace cg = cooperative_groups;

__device__
int func(){

    auto group = cg::tiled_partition<8>(cg::this_thread_block());

    int indata[4];
    int key = 0;

    {
        constexpr int values[8]{0x41686f6a, 0x31323334, 0, 0, 0, 0, 0, 0};
        key = values[8-1-group.thread_rank()];
    }

    indata[0] = key ^ 0x36363636;
    indata[1] = 0x36363636;

    {
        constexpr int values[8]{0x504d4b20, 0x4e616d65, 0xc0c1c04b, 0xfc68fcc2, 0x330136c6, 0x80000000, 0, 0};
        indata[2] = values[8-1-group.thread_rank()];
    }

    {
        constexpr int values[8]{0, 0, 0, 0, 0, 0, 0, (64 + 20) * 8};
        indata[3] = values[8-1-group.thread_rank()];
    }

    return indata[0] + indata[1] + indata[2] + indata[3];
}

__device__
int func2(){

    auto group = cg::tiled_partition<8>(cg::this_thread_block());

    int indata[4];
    int key = 0;

    if(group.thread_rank() == 7){ key = 0x41686f6a; }
    else if(group.thread_rank() == 6){ key = 0x31323334; }

    indata[0] = key ^ 0x36363636;
    indata[1] = 0x36363636;

    {
        constexpr int values[8]{0x504d4b20, 0x4e616d65, 0xc0c1c04b, 0xfc68fcc2, 0x330136c6, 0x80000000, 0, 0};
        #pragma unroll
        for(int i = 0; i < 8; i++){
            if(i == 8-1-group.thread_rank()){
                indata[2] = values[i];
                break;
            }
        }
    }

    if(group.thread_rank() == 0){
        indata[3] = (64 + 20) * 8;
    }else{
        indata[3] = 0;
    }


    return indata[0] + indata[1] + indata[2] + indata[3];
}

cbuchner1 · February 28, 2023, 10:32pm

My guess is that the above code is part of a cryptographic hash function, maybe cryptocurrency related. One does not typically SIMD-optimize a cryptographic hash without the need of running many such hashes in parallel. The code may be intending to brute force the hash (e.g for mining or password recovery) by running many simultaneous hashing operations. In this case it should not really matter if a CUDA thread per AVX2 lane is used or if one CUDA thread processes the equivalent of 8 lanes. There will be a difference in per thread register use and occupancy. Only benchmarking will show which alternative is faster.

Update: the code is part of a SHA1 hash? See HMAC, SHA1 (C++/AVX2 VS2022) In this case it might be easier to base the CUDA version on a non SIMD optimized version of said hash function.

njuffa · February 28, 2023, 11:25pm

This. Generally speaking, the point of CUDA is to write code in per-thread scalar fashion and let the SIMT hardware worry about parallel execution. Introducing notions of explicit SIMD-ness into CUDA programs is usually not a good idea, with some exceptions for data types smaller than 32 bit.

As @cbuchner1 says, it is therefore usually most appropriate to start the design effort for a CUDA port from an existing scalar implementation rather than a version optimized for some existing fixed-width explicit-SIMD architecture.

If this is from a SHA1 implementation, it would probably be a good idea to search the internet for existing CUDA implementations, of which there should be plenty. I am pretty sure I came across CUDA-based “SHA1-crackers” more than a decade ago, and the introducion of the LOP3 instruction should have given a significant boost to such codes.

Topic		Replies	Views
Any simple SHA256 and RIPEMD160 CUDA implementations I can try? CUDA Programming and Performance	7	5942	December 22, 2018
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
newbie in Cuda needs help with 2D arrays CUDA Programming and Performance	9	922	March 9, 2018
Cant modify data on the GPU CUDA Programming and Performance	16	10242	December 20, 2008
Help a newbie with CUDA development! CUDA Programming and Performance	16	14552	May 1, 2008
How to use the `ex2.approx.f16x2` instruction? CUDA Programming and Performance	2	299	August 28, 2024
Troubles converting float array to float2 array CUDA Programming and Performance	11	3536	July 3, 2017
New to cuda need some help CUDA Programming and Performance	0	1546	July 2, 2009
Library for conversion of cuda datatypes? CUDA Programming and Performance	9	2338	October 16, 2019
nvcc with avx support cannot find gcc builtin intrinsics CUDA Programming and Performance	5	3403	October 12, 2014

AVX2 code conversion to CUDA

Related topics