{ } in SASS output generated for cc5.0, cc6.0 and cc6.1

I compiled the same code with -arch=sm_61 and checked the sass code with cuobjdumo (10.0). I got several pairs of curly braces in the output sass code. The following code is the beginning of the sass code and it shows the braces at the end. Does anyone know what do them mean? I do not see braces with -arch=sm_35 but do see them with -arch=sm_50 and -arch=sm_60.

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_61
		Function : _Z12kclock_test2PjS_iijS_
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                                     /* 0x001fc400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                            /* 0x4c98078000870001 */
        /*0010*/                   ISETP.NE.AND P0, PT, RZ, c[0x0][0x158], PT ;      /* 0x4b6b03800567ff07 */
        /*0018*/                   MOV R4, c[0x0][0x154] ;                           /* 0x4c98078005570004 */
                                                                                     /* 0x001ff400fe0007fb */
        /*0028*/                   MOV R10, RZ ;                                     /* 0x5c9807800ff7000a */
        /*0030*/         {         IADD R4, R4, c[0x0][0x150] ;                      /* 0x4c10000005470404 */
        /*0038*/              @!P0 BRA 0x478         }

I don’t think this is documented anywhere, but from looking at a fair number of examples I believe this means the bracketed instructions will dual issue from the same warp. You could probably confirm this by looking at the third-party ‘maxas’ assembler, because this would have to be explicitly encoded in the control block for each three-instruction bundle (in your example the 64-bit data at 0x20).

Scott Gray described the Maxwell control block as follows (https://devtalk.nvidia.com/default/topic/773064/maxwell-assembler/):

Presumably intra-warp dual issue is encoded via the yield field, or a combination of the yield and stall fields. The reason you do not see bracketed instructions with Kepler is because Kepler does not have intra-warp dual issue capability and encodes instructions using a more rudimentary 64-bit control block for a seven-instruction bundle. And you probably don’t see it with Volta/Turing (I have no hands-on experience as of now) because dual issue mechanism and instruction encoding have changed significantly yet again, with even more control block information stored per instruction.

Why Kepler does not have dual issue capabilities? I found the following post discussing dual issue for the Kepler.
https://devtalk.nvidia.com/default/topic/1057703/cuda-programming-and-performance/dual-issue-and-other-timing-behavior-of-the-kepler-warp-scheduler-/

Sorry, can’t help you. I have already speculated as far as I reasonably could based on very limited knowledge. If you want to take this further, I suggest you either contact the people who have taken the time to reverse engineer the architecture specific control words used by GPUs, read their write-ups where available, or reverse engineer this yourself.

I don’t have a deeper interest in this kind of reverse engineering, and I never have attempted nor plan to attempt to program GPUs at the SASS level. With a new architecture generation out every two years or so, and NVIDIA giving zero support to such endeavors, I consider that an exercise in futility.

Kepler has dual issue capability (in fact, Fermi SM 2.1 had some limited dual-issue capability). If it did not, there would be no way to supply (and therefore no reason to have) 192 cores in a single SM, with only 4 warp schedulers per SM.

But the exact mechanisms of dual issue have changed with changing SM design across GPU architectures. The changing SM design to support dual issue (I surmise) has necessitated changes to the structure of SASS (e.g. binary encoding, if nothing else) from one architecture to the next.

I’m also probably not much help beyond that, for a variety of reasons including those mentioned by njuffa plus:

  1. I don’t know how to reverse engineer things
  2. I might get fired if I release material non-public information in a public forum.

You can always request documentation changes using the bug reporting method linked in a sticky post at the top of this forum.

What seems to be reasonably clear is that NVIDIA’s successive GPU architectures seem to push more and more control down to the instruction bundles, presumably to keep hardware complexity low while allowing an increasing amount of increasingly general dual issue capabilities. I think the argument could be made that these are actually VLIW designs, except that for marketing reasons nobody wants to call them that.

I may confuse / conflate Fermi and Kepler here, but as I recall whatever limited dual issue capability there existed in these older architectures was hardly directly exploitable; even hard to just observe occurring in practice. These were designs with a sub-optimal balance between execution resources and “enabling” resources, such as scheduling resources. As GPU users we are benefiting from NVIDIA’s learning curve here.

Because I was involved in the design of non-Intel x86 processors, where binary compatibility is key, I do have experience with reverse engineering the microarchitecture of processors. But this is slow tedious work and in a rapidly changing environment like the GPU world I can envision only a few uses cases where investing that time (= money) makes sense.