I’m curious why cuobjdump -sass my.cubin
shows a bunch of NOPs at the end of the code.
<snip>
/*00d0*/ FADD R9, R4, R3 ; /* 0x0000000304097221 */
/* 0x004fd00000000000 */
/*00e0*/ STG.E.SYS [R6], R9 ; /* 0x0000000906007386 */
/* 0x000fe2000010e900 */
/*00f0*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*0100*/ BRA 0x100; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*0110*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0120*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0130*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0140*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0150*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0160*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0170*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........
This matches the documentation for the example of cuobjdump: CUDA Binary Utilities :: CUDA Toolkit Documentation
However disassembling with nvdisasm my.cubin
appears to end exactly at the final break instruction.
<snip>
/*00d0*/ FADD R9, R4, R3 ;
/*00e0*/ STG.E.SYS [R6], R9 ;
/*00f0*/ EXIT ;
.L_1:
/*0100*/ BRA `(.L_1);
.L_28:
For reference I’m trying this all using the simple “vectorAdd” sample included with the CUDA examples.