What does LOP.AND.NZ do?

I have a couple questions regarding SASS code.

LOP.AND.NZ P0, RZ, R5, 0x2

1/ afaik, LOP is a logic operator, AND is &, and NZ is non-zero. But I’m not sure what it will do when the code combines all three in 1 instruction. I have no idea about number 0x2 at the end the instruction.

2/ How the constant banks are distributed? In my experience, c[0x0][0x08] is blockDim, c[0x0][0x20] maybe the kernel address(?). Is there anything else that people have discovered?

What architecture is this from? I am guessing Maxwell (sm_50).

(1) It would be easier to determine by looking at the source code. The 2 is an immediate operand, so presumably your source contains something like var & 2 or <expr> & 2. The function of the .NZ suffix is unclear, I don’t think NVIDIA has documented the machine instructions to that level of detail. Apparently there are also .Z and .T variants, suggesting that this pertains to the predicate. Without additional context I am guessing, but I think RZ is the destination register here, so the result of the AND actually goes to the bit bucket, and the actual output of interest is the predicate P0, which would make this a TEST/TST/ANDcc kind of instruction:

p0 = (R5 & 2) != 0

(2) Historically, constant memory bank usage has changed multiple times between GPU architectures, so you would need to be more specific as to what architecture you are looking at. You should be able to reverse engineer most of it by looking at kernels of increasing complexity.

(1) Yes it’s sm_52. your answer makes it more clear to me. I’m also looking at a closed-source kernel, still guessing around.
(2) Wow, I thought it’s “constant” like the name and unchanged between GPU arch. But thanks, I can do reverse engineering to figure it out.

You would want to hone your proficiency of decoding the relevant machine language before tackling unknown kernels for which you don’t have the source. The textual decoding provided by NVIDIA’s disassembler is very much machine centric, which doesn’t help with human comprehension. So you get things like IMAD.ADD which is functionally an IADD, LOP.PASS_THROUGH which is functionally a MOV, etc.

The reason NVIDIA won’t commit to binary compatibility of the ISA is precisely because they want the freedom of changing things around from architecture to architecture, presumably based on their latest data on common usage patterns, code performance implications, and implementation efficiency. Occasionally some feature stays static for a couple of major architecture generations (e.g. XMAD as integer multiply building block), only to be changed again later on.

The basic idea behind “constant memory” hasn’t changed, though: it is intended for data that stays constant across a kernel invocation and access to which is (mostly) uniform across the threads in a warp. Its presence is a hard requirement for graphics, as I recall, so compute just inherited it.

For sm_52, I am seeing the following:

c[0x0][0x8] is blockDim.x
c[0x0][0xc] is blockDim.y
c[0x0][0x10] is blockDim.z
c[0x0][0x14] is gridDim.x
c[0x0][0x18] is gridDim.y
c[0x0][0x1c] is gridDim.z

It looks like c[0x0]0x20] may be the initial value of the stack pointer. Kernel arguments start at c[0x0][0x140].

1 Like

Thanks for the useful info. I’m looking at a “password cracking” tool, here is the beginning of the kernel:

MOV R1, c[0x0][0x20] // LOAD ARG
MOV R2, c[0x0][0x190] // LOAD ARG
MOV R3, c[0x0][0x194] // LOAD ARG

LDG.E R2, [R2] // LOAD from ptr R2
ISETP.NE.AND P0, PT, R2, RZ, PT // if R2 != 0:
@P0 EXIT // then EXIT

MOV R30, c[0x0][0x174] // LOAD ARG
MOV R40, c[0x0][0x170] // LOAD ARG
MOV32I R6, 0x1 // R6 = 1
MOV R7, c[0x0][0x164] // LOAD ARG
SHF.L.W R2, R40, 0x1, R30 // ROTATE LEFT 1
LOP.XOR R6, R6, c[0x0][0x160] // XOR
SHF.L.W R0, R30, 0x1, R40 // ROTATE LEFT 1
LOP3.LUT R5, R7, RZ, R2, 0x96 // XOR 3
LOP3.LUT R12, R6, RZ, R0, 0x96 // XOR 3
SHF.L.W R11, R12, 0xc, R5 // ROTATE LEFT 12
SHF.L.W R12, R5, 0xc, R12 // ROTATE LEFT 12
MOV R17, c[0x0][0x17c] // FROM ARG
LOP3.LUT R18, R6, c[0x0][0x168], R0, 0x96 // XOR 3
LOP3.LUT R3, R7, c[0x0][0x16c], R2, 0x96 // XOR 3
LOP.XOR R0, R0, R6 // XOR1
LOP.XOR R61, R2, c[0x0][0x164] // XOR1

XMAD R10, R9, c[0x0][0x8], RZ // R10 = blockDim.x * R9
XMAD.MRG R15, R9, c[0x0][0x8].H1, RZ // R15 = (R9 * 64 + 0) + 64 << 16
MOV R8, c[0x0][0x178] // LOAD ARG
…(other computation)

(3) I got lost in the last 3 lines, I’m not sure where is $R9 coming from.
(4) I didn’t dump SASS code with cuobjdump since the tool using JIT compiler (apparently the tool owner also wanted to protect it, even nvprof couldn’t run the app). I had to use cuda-gdb to get these SASS code. is it a good way to extract sass? or it can cause missing some line of code.

It has been a good long while since I last looked at the details of the various XMAD flavors. What I would do here is look at various integer multiplies (e.g. a 64-bit integer multiply) and reverse engineer the exact semantics of the multiple XMAD variants used in those, which are variations on 16x16+32->32 bit multiply-add. .MRG is “merge”, but I don’t recall how that works. There is also .PSL = “product shift left” or some such.

yes, I’m aware of that topic: XMAD meaning - #9 by SPWorley

I may be misunderstanding what you’re saying here, but is this any use?

the tool that I’m trying to reverse engineering doesn’t create any cache in .nv directory, checked with strace. He also forks a lot of “watchdog” threads to break the program when there’s any potential reverse engineering behavior. Apparently, all sass code (both compiled kernels or JIT kernels) will be pushed up to GPU mem (maybe it’s even in RAM), with a little tweak we can get them all.
Anyway, running the tool on sm_75 with the latest driver produces much nicer sass code and so now I can understand it.

Thanks for all the help!

side note: nvdisasm still cannot disasm sm_80 and sm_86.

I don’t seem to have any trouble using nvdisasm to disassemble a cubin created with nvcc -arch=sm_86 -cubin ...

Tested with CUDA 11.1 and 11.2

1 Like

Do you have CUDA 11.1 installed? I just tried cuobjdump --dump-sass on a file with sm_80 and sm_86 SASS and it disassembled it just fine. Best I know cuobjdump just invokes nvdisasm under the hood when --dump-sass is specified.

1 Like

oh my, I have cuda 11.0 not 11.1, is that really an issue? This is what it show:

nvdisasm fatal : Value ‘SM86’ is not defined for option ‘binary’
0x7f79b07e7c00 <_Z26search64PV14…jPKj>: Cannot disassemble instruction

Unless I missed something, CUDA 11.x is the version that added Ampere support. So yes, you do need CUDA 11.x for sm_8x, and preferably the latest.

1 Like

CUDA 11.0 added support for sm_80.

CUDA 11.1 added support for sm_86

1 Like