{ } in SASS output generated for cc5.0, cc6.0 and cc6.1

lxzhang · January 28, 2020, 9:23pm

I compiled the same code with -arch=sm_61 and checked the sass code with cuobjdumo (10.0). I got several pairs of curly braces in the output sass code. The following code is the beginning of the sass code and it shows the braces at the end. Does anyone know what do them mean? I do not see braces with -arch=sm_35 but do see them with -arch=sm_50 and -arch=sm_60.

Fatbin elf code:
================
arch = sm_61
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

	code for sm_61
		Function : _Z12kclock_test2PjS_iijS_
	.headerflags    @"EF_CUDA_SM61 EF_CUDA_PTX_SM(EF_CUDA_SM61)"
                                                                                     /* 0x001fc400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                            /* 0x4c98078000870001 */
        /*0010*/                   ISETP.NE.AND P0, PT, RZ, c[0x0][0x158], PT ;      /* 0x4b6b03800567ff07 */
        /*0018*/                   MOV R4, c[0x0][0x154] ;                           /* 0x4c98078005570004 */
                                                                                     /* 0x001ff400fe0007fb */
        /*0028*/                   MOV R10, RZ ;                                     /* 0x5c9807800ff7000a */
        /*0030*/         {         IADD R4, R4, c[0x0][0x150] ;                      /* 0x4c10000005470404 */
        /*0038*/              @!P0 BRA 0x478         }

njuffa · January 28, 2020, 9:46pm

I don’t think this is documented anywhere, but from looking at a fair number of examples I believe this means the bracketed instructions will dual issue from the same warp. You could probably confirm this by looking at the third-party ‘maxas’ assembler, because this would have to be explicitly encoded in the control block for each three-instruction bundle (in your example the 64-bit data at 0x20).

Scott Gray described the Maxwell control block as follows (https://devtalk.nvidia.com/default/topic/773064/maxwell-assembler/):

Presumably intra-warp dual issue is encoded via the yield field, or a combination of the yield and stall fields. The reason you do not see bracketed instructions with Kepler is because Kepler does not have intra-warp dual issue capability and encodes instructions using a more rudimentary 64-bit control block for a seven-instruction bundle. And you probably don’t see it with Volta/Turing (I have no hands-on experience as of now) because dual issue mechanism and instruction encoding have changed significantly yet again, with even more control block information stored per instruction.

lxzhang · January 28, 2020, 10:16pm

Why Kepler does not have dual issue capabilities? I found the following post discussing dual issue for the Kepler.
https://devtalk.nvidia.com/default/topic/1057703/cuda-programming-and-performance/dual-issue-and-other-timing-behavior-of-the-kepler-warp-scheduler-/

njuffa · January 28, 2020, 10:27pm

Sorry, can’t help you. I have already speculated as far as I reasonably could based on very limited knowledge. If you want to take this further, I suggest you either contact the people who have taken the time to reverse engineer the architecture specific control words used by GPUs, read their write-ups where available, or reverse engineer this yourself.

I don’t have a deeper interest in this kind of reverse engineering, and I never have attempted nor plan to attempt to program GPUs at the SASS level. With a new architecture generation out every two years or so, and NVIDIA giving zero support to such endeavors, I consider that an exercise in futility.

Robert_Crovella · January 28, 2020, 11:43pm

Kepler has dual issue capability (in fact, Fermi SM 2.1 had some limited dual-issue capability). If it did not, there would be no way to supply (and therefore no reason to have) 192 cores in a single SM, with only 4 warp schedulers per SM.

But the exact mechanisms of dual issue have changed with changing SM design across GPU architectures. The changing SM design to support dual issue (I surmise) has necessitated changes to the structure of SASS (e.g. binary encoding, if nothing else) from one architecture to the next.

I’m also probably not much help beyond that, for a variety of reasons including those mentioned by njuffa plus:

I don’t know how to reverse engineer things
I might get fired if I release material non-public information in a public forum.

You can always request documentation changes using the bug reporting method linked in a sticky post at the top of this forum.

njuffa · January 29, 2020, 12:09am

What seems to be reasonably clear is that NVIDIA’s successive GPU architectures seem to push more and more control down to the instruction bundles, presumably to keep hardware complexity low while allowing an increasing amount of increasingly general dual issue capabilities. I think the argument could be made that these are actually VLIW designs, except that for marketing reasons nobody wants to call them that.

I may confuse / conflate Fermi and Kepler here, but as I recall whatever limited dual issue capability there existed in these older architectures was hardly directly exploitable; even hard to just observe occurring in practice. These were designs with a sub-optimal balance between execution resources and “enabling” resources, such as scheduling resources. As GPU users we are benefiting from NVIDIA’s learning curve here.

Because I was involved in the design of non-Intel x86 processors, where binary compatibility is key, I do have experience with reverse engineering the microarchitecture of processors. But this is slow tedious work and in a rapidly changing environment like the GPU world I can envision only a few uses cases where investing that time (= money) makes sense.

Topic		Replies	Views
Dual-issue and other timing behavior of the Kepler warp scheduler? CUDA Programming and Performance	5	1402	July 27, 2019
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1982	October 28, 2016
Understanding CUDA scheduling CUDA Programming and Performance	4	16032	May 20, 2014
GPU architecture and warp scheduling CUDA Programming and Performance	10	20696	February 10, 2018
Two dispatch units in the Kepler.. is it possible execute two instructions in a warp at the same tim CUDA Programming and Performance	19	6282	April 19, 2013
cuda SASS question CUDA Programming and Performance	4	1972	June 18, 2018
SASS divergence prevention of non-divergent code CUDA Programming and Performance	5	1908	October 1, 2022
Wonder if SASS corresponds to the real binary that will be run on GPU without further more optimizat CUDA Programming and Performance	3	822	January 14, 2014
Don't see the SASS code via objdump CUDA Programming and Performance	3	1094	February 19, 2020
Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5 CUDA Programming and Performance	21	3783	April 2, 2015

{ } in SASS output generated for cc5.0, cc6.0 and cc6.1

Related topics