For a compiled dll (mex) how to generate SASS output

For a typical application I can use:

cuobjdump a.out -ptx -sass

to generate the corresponding SASS or PTX, but what about for a CUDA/C++ mex file compiled for use with MATLAB?

I can take the text of that C++/CUDA code,copy,paste,edit and convert to a simple executable and generate SASS from there, but if there is a way without that extra work I would appreciate direction to that end.

Even though I set the -v flag for verbose PTXAS output for compilation of the mex, the output folder has less output files than the equivalent executable.

Sorry if there is an obvious answer to this question, I really have not had to need to look at the assembly for a CUDA mex until now.

I have no experience with mex files, and don’t know whether they embed PTX or SASS. If the mex file is really just a binary DLL with embedded SASS, you should be able to run cuobjdump --dump-sass on it to extract the SASS code. Presumably cuobjdump looks for the ELF tag of the CUDA code and starts disassembling from there. Worth a try, if you haven’t tried yet.

That did work, thanks.

What does MUFU mean ? I can see the definition as a ‘multi-function operator’, but not sure what that means in the context of my kernel.

Eaxmple:

/*2838*/                   MUFU.RCP R13, R21;

 /*2848*/                   FMUL.FTZ R10, R10, R10;
 /*2850*/                   FMUL.FTZ R10, R10, R13;
 /*2858*/                   SYNC;

 /*2868*/                   MUFU.RCP R28, R8;
 /*2870*/                   FADD.FTZ R19, R11, -R29.reuse;
 /*2878*/                   SSY 0x2a10;

 /*2888*/                   IADD R17, R23, 0x2;
 /*2890*/                   FADD.FTZ R30, R15, -R29;
 /*2898*/                   IADD R13, R20, 0x2;

 /*28a8*/                   SHR R20, R17.reuse, 0x1e;
 /*28b0*/                   FMUL.FTZ R29, R19, R28.reuse;
 /*28b8*/                   SHL.W R19, R17, 0x2;

 /*28c8*/                   FMUL.FTZ R17, R30, R28;
 /*28d0*/                   SHR R14, R13.reuse, 0x1e;
 /*28d8*/                   SHL.W R13, R13, 0x2;

 /*28e8*/                   FMNMX.FTZ R28, R29, 1, PT;
 /*28f0*/                   FMNMX.FTZ R29, R17, 1, PT;
 /*28f8*/                   FADD.FTZ R17, R12, -R10;

 /*2908*/                   IADD R10.CC, R13, c[0x0][0x148];
 /*2910*/                   IADD R23, R27, 0x2;
 /*2918*/                   FMNMX.FTZ R13, R28, RZ, !PT;

 /*2928*/                   FMNMX.FTZ R28, R29, RZ, !PT;
 /*2930*/                   FADD.FTZ R29, R11, -R22;
 /*2938*/                   IADD.X R11, R14, c[0x0][0x14c];

 /*2948*/                   IADD R12.CC, R19, c[0x0][0x148];
 /*2950*/                   SHL.W R27, R23, 0x2;
 /*2958*/                   FADD.FTZ R19, R13, -R28;

 /*2968*/                   FMNMX.FTZ R28, R29, R0, PT;
 /*2970*/                   FADD.FTZ R15, R15, -R22;
 /*2978*/                   IADD.X R13, R20, c[0x0][0x14c];

 /*2988*/                   IADD R14.CC, R27, c[0x0][0x148];
 /*2990*/                   SHR R23, R23, 0x1e;
 /*2998*/                   FMUL.FTZ R20, R8, R19;

 /*29a8*/                   FSETP.GT.FTZ.AND P2, PT, R28, RZ, PT;
 /*29b0*/                   FMNMX.FTZ R19, R15, R0, PT;
 /*29b8*/                   MOV R0, RZ;

 /*29c8*/                   IADD.X R15, R23, c[0x0][0x14c];
 /*29d0*/                   FFMA.FTZ R17, R7, R17, R20;
 /*29d8*/              @!P2 SYNC;

 /*29e8*/                   MUFU.RCP R22, R21;
 /*29f0*/                   FMUL.FTZ R0, R28, R28;
 /*29f8*/                   FMUL.FTZ R0, R0, R22;

 /*2a08*/                   SYNC;
 /*2a10*/                   FSETP.GT.FTZ.AND P2, PT, R19, RZ, PT;
 /*2a18*/                   SSY 0x2a70;

MUFU stands for multi-function unit, formerly known as SFU, the special function unit. MUFU.RCP is the single-precision reciprocal approximation built into the hardware, computed by quadratic interpolation in a table using fixed-point arithmetic. See this paper for a description of the MUFU:

S.F. Oberman and M.Y. Siu, “A High-Performance Area-Efficient Multifunction Interpolator,” Proc. 17th IEEE Symp. Computer Arithmetic (Arith-17), IEEE Press, 2005, pp. 272-279.
http://arith.polito.it/final/paper-164.pdf

MUFU is a multi-function unit. It used to be called SFU, which gets called out as a hardware entity on various block diagrams of CUDA GPU SMs.

MUFU.RCP takes the reciprocal, I believe. The MUFU can do various operations, such as reciprocal square root, sine, cosine, etc.

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#fermi__fermi-instruction-set

(edited based on comments from njuffa below)

Note that reciprocal and reciprocal square root are not transcendental functions. As far as I am aware, this is a full list of the MUFU sub-opcodes (easily verifiable by dis-assembling the SASS for appropriate source code):

MUFU.RCP // reciprocal
MUFU.RSQ // reciprocal square root
MUFU.SIN // sine
MUFU.COS // cosine
MUFU.LG2 // logarithm base 2
MUFU.EX2 // exponentiation base 2
MUFU.RSQ64H // reciprocal square root on upper half of DP operand
MUFU.RCP64H // reciprocal on upper half of DP operand

A quick check using build targets sm_20 through sm_50 does not show any evidence of a MUFU.SQRT instruction.