Is there a reverse engineering tool which gives (approximate) CUDA C++ code from PTX code?

abhishekghos · September 4, 2024, 1:17pm

I was thinking of a tool using which I could get an approximate view of the CUDA C++ code corresponding to a PTX code.

As shown in the following blog by Fei Kong which talks about reverse engineering PTX to CUDA a multiple times.

Given a PTX:

//
// Generated by LLVM NVPTX Back-End
//

.version 8.4
.target sm_75
.address_size 64

	// .globl	kernel

.visible .entry kernel(
	.param .u64 kernel_param_0,
	.param .u64 kernel_param_1,
	.param .u32 kernel_param_2
)
.maxntid 128, 1, 1
{
	.reg .pred 	%p<3>;
	.reg .b32 	%r<9>;
	.reg .b64 	%rd<6>;
	.loc	1 9 0
$L__func_begin0:
	.loc	1 9 0

	ld.param.u64 	%rd3, [kernel_param_0];
	ld.param.u64 	%rd4, [kernel_param_1];
$L__tmp0:
	.loc	1 16 24
	// begin inline asm
	mov.u32 %r1, %ctaid.x;
	// end inline asm
	.loc	1 17 20
	shl.b32 	%r4, %r1, 7;
	ld.param.u32 	%r5, [kernel_param_2];
	.loc	1 17 48
	mov.u32 	%r6, %tid.x;
	and.b32  	%r7, %r6, 127;
	.loc	1 17 35
	or.b32  	%r8, %r4, %r7;
	.loc	1 18 21
	setp.lt.s32 	%p1, %r8, %r5;
	.loc	1 19 24
	mul.wide.s32 	%rd5, %r8, 4;
	add.s64 	%rd1, %rd3, %rd5;
	.loc	1 19 16
	// begin inline asm
	mov.u32 %r3, 0x0;
	@%p1 ld.global.b32 { %r3 }, [ %rd1 + 0 ];
	// end inline asm
	.loc	1 20 21
	add.s64 	%rd2, %rd4, %rd5;
	.loc	1 20 30
	// begin inline asm
	@%p1 st.global.b32 [ %rd2 + 0 ], { %r3 };
	// end inline asm
	.loc	1 20 4
	ret;
$L__tmp1:
$L__func_end0:

}
	.file	1 "/tmp/ipykernel_2092734/781878086.py"
	.section	.debug_abbrev
	{
.b8 1
.b8 17
.b8 0
.b8 37
.b8 8
.b8 19
.b8 5
.b8 3
.b8 8
.b8 16
.b8 6
.b8 27
.b8 8
.b8 17
.b8 1
.b8 18
.b8 1
.b8 0
.b8 0
.b8 0
	}
	.section	.debug_info
	{
.b32 73
.b8 2
.b8 0
.b32 .debug_abbrev
.b8 8
.b8 1
.b8 116
.b8 114
.b8 105
.b8 116
.b8 111
.b8 110
.b8 0
.b8 2
.b8 0
.b8 55
.b8 56
.b8 49
.b8 56
.b8 55
.b8 56
.b8 48
.b8 56
.b8 54
.b8 46
.b8 112
.b8 121
.b8 0
.b32 .debug_line
.b8 47
.b8 116
.b8 109
.b8 112
.b8 47
.b8 105
.b8 112
.b8 121
.b8 107
.b8 101
.b8 114
.b8 110
.b8 101
.b8 108
.b8 95
.b8 50
.b8 48
.b8 57
.b8 50
.b8 55
.b8 51
.b8 52
.b8 0
.b64 $L__func_begin0
.b64 $L__func_end0
	}
	.section	.debug_loc	{	}

I would like to have a rough CUDA C++ as follows (for understanding):

__global__ void kernel(float *input, float *output, int N) {
    // Get the block and thread index
    unsigned int blockId = blockIdx.x;
    unsigned int threadId = threadIdx.x;

    // Calculate global index based on block and thread indices
    unsigned int globalIndex = (blockId << 7) | (threadId & 127);  // (blockIdx.x << 7) + (threadIdx.x & 127)

    // Perform bounds check to avoid out-of-bounds memory access
    if (globalIndex < N) {
        // Load value from input array at globalIndex
        float value = input[globalIndex];

        // Store the value into the output array at the same index
        output[globalIndex] = value;
    }
}

The above reverse engineering was done by ChatGPT.

Curefab · September 4, 2024, 4:21pm

The PTX is often part of programs to make them compatible to new GPU generations for on-the-fly assembly. Giving the C++ code could incentivize companies not to release the PTX code with their binaries.

I do not know of any reverse-engineering program of PTX back to C/C++ code.
A simple one, just translating instruction by instruction should be easy to write.
Combining the instructions to higher-level constructs probably would be impossible, especially for optimized release builds.

C++ CPU code often calls functions (even more so with virtual machine languages like Java), PTX code is in most cases fully inlined and optimized.

I would guess that ChatGPT could understand simple constructs, parts of programs, but would not be able to understand complicated code.

I have tried out ChatGPT 4 in the past to follow just 4 or 5 lines of C code, which did math calculations. It had difficulty to understand the logic correctly. Even when I as human gave feedback of the functioning of the code and where it went wrong with understanding it.

PTX has similar complexity to follow the registers and the math and logical operations and control flow.

I would expect it to take ChatGPT a few more years to come near to understanding and translating it correctly. Perhaps starting with a ChatGPT 7 or 8.

Topic		Replies	Views
Translating CUDA Programs to Other Architectures than GPUs Tech Report CUDA Programming and Performance	3	4183	January 23, 2009
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26841	January 19, 2009
Ability to run PTX directly CUDA Programming and Performance	2	4391	November 11, 2009
Ptx CUDA Driver API "Hello world" in VS 2010 CUDA Programming and Performance	5	1993	September 12, 2013
PTX Code Transformations CUDA Programming and Performance	2	5146	September 16, 2009
Strange PTX Output CUDA Programming and Performance	9	3297	December 19, 2014
CUDA 5.5 produces different PTX code from CUDA 5.0 with degraded performance CUDA Programming and Performance	3	1437	August 15, 2013
How to use PTX prefetch.global with ASM? compiles but do not see prefetch instruction with cuobjdump CUDA Programming and Performance	7	5150	May 7, 2012
Bruteforce with CUDA CUDA Programming and Performance	1	5976	July 13, 2008
Problems with hand-made PTX and driver API Difficulty getting a simple hand-written PTX program to w CUDA Programming and Performance	13	3184	October 12, 2011

Is there a reverse engineering tool which gives (approximate) CUDA C++ code from PTX code?

Related topics