I am a newbie with using decuda and I tried disassembling a simple CUDA kernel as shown below:
// Kernel that executes on the CUDA device
global void twice_array(float a, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = 2a[idx];
The cubin disassembled output is as follows:
// Disassembling _Z11twice_arrayPfi
000000: 10000205 40004780 mov.u16 $r0.hi, %ntid.y
000008: a0000005 04000780 cvt.rn.u32.u16 $r1, $r0.lo
000010: 60014c01 00204780 mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1
000018: 3000cdfd 6c20c7c8 set.le.s32 $p0|$o127, s[0x0018], $r0
000020: 30000003 00000280 @$p0.ne return
000028: 30020001 c4100780 shl.u32 $r0, $r0, 0x00000002
000030: 2000c805 04200780 add.u32 $r1, s[0x0010], $r0
000038: d00e0201 80c00780 mov.u32 $r0, g[$r1]
000040: b0000001 00000780 add.rn.f32 $r0, $r0, $r0
000048: d00e0201 a0c00781 mov.end.u32 g[$r1], $r0
Now, the second last instruction uses an f32 data type. I am wondering if this is an error because both the source operands are u32 type. AFIAK, the value loaded from the global memory should be going into a floating register. Or is it that decuda just uses the same register name for all data type-floating and unsigned? The destination data type is floating though.
Your help is greatly appreciated.
Also, I know that the first instruction loads the block dimension blockDim.x. But why is it not represented as a register or a memory location as opposed to some variable name?
Registers are just registers, there are no specialized registers for floating point or integer on the hardware level. Don’t get misled by the fact that decuda tries to follow PTX syntax as closely as possible. Decuda output is not PTX. It is all based on reverse engineering of the hardware.
Decuda is kind of outdated as well. Because of decuda’s existence, Nvidia has made their own disassembler publicly available. However it disassembles to its own syntax which is not documented.
Thanks tera for your response.
Yes, I am aware that Decuda disassembles to SASS and not PTX. What are your thoughts on the other question with the %ntid.y parameter? Mind you, s[0x00c] represents blockIdx.x. So at the least it should have been %ntid.x or in a better sense, the register or memory location where it is stored rather than some variable.
Anyone on the %ntid issue please?
Seems so. cuobjdump outputs [font=“Courier New”]MOV.U16 R0H, g [0x1].U16;[/font] which is ntid.x.
Thank you Tera. I appreciate your response.
I am also wondering on one more issue with the assembly generated. The set.le.s32 $p0|$o127, s[0x0018], $r0 instruction is basically doing the comparison of idx and N. According to the instruction, if N<idx, the predicate $p0 is set to 1. However the next predicate instruction @$p0.ne return disables all threads that have the predicate set to 0, which actually correspond to threads with their idx < N!
Shouldn’t it be @$p0.eq return, so that all threads with idx > N are disabled?