BitField Generation makes no sense

The basic problem is that the compiler generates insane ISA for bitfields, while using bit-field extract manually results in MUCH more compact (and faster) code.

Take the following two simple kernels, one that uses bitfields and the other that uses bit-field extract:

struct bitField {
  int a : 3;
  int b : 7;
  int c : 22; 
};

__global__
void separateBitFields(bitField *in, int *out1, int *out2, int *out3)
{
  const int tid = threadIdx.x;

  bitField foo = in[tid];
  out1[tid] = foo.a;
  out2[tid] = foo.b;
  out3[tid] = foo.c;
}

__device__ __forceinline__
unsigned int bfe(unsigned int x, unsigned int bit, unsigned int numBits) {
    unsigned int ret;
    asm("bfe.u32 %0, %1, %2, %3;" :
            "=r"(ret) : "r"(x), "r"(bit), "r"(numBits));
    return ret;
}

__global__
void separateBitFields(unsigned int *in, unsigned int *out1, unsigned int *out2, unsigned int *out3)
{
  const int tid = threadIdx.x;

  unsigned int foo = in[tid];

  out1[tid] = bfe(foo, 0, 3); 
  out2[tid] = bfe(foo, 3, 7); 
  out3[tid] = bfe(foo, 10, 22);
}

And then look at the output of the generated ISA for the two:

code for sm_20
    Function : _Z17separateBitFieldsPjS_S_S_
  /*0000*/     /*0x00005de428004404*/   MOV R1, c [0x1] [0x100];
  /*0008*/     /*0x84019c042c000000*/   S2R R6, SR_Tid_X;
  /*0010*/     /*0x10029de218000000*/   MOV32I R10, 0x4;
  /*0018*/     /*0x1061dce35000c000*/   IMUL.HI R7, R6, 0x4;
  /*0020*/     /*0x80621ca320158000*/   IMAD R8.CC, R6, R10, c [0x0] [0x20];
  /*0028*/     /*0x90725c4348004000*/   IADD.X R9, R7, c [0x0] [0x24];
  /*0030*/     /*0xa0611ca320158000*/   IMAD R4.CC, R6, R10, c [0x0] [0x28];
  /*0038*/     /*0x00801c8584000000*/   LD.E R0, [R8];
  /*0040*/     /*0xb0715c4348004000*/   IADD.X R5, R7, c [0x0] [0x2c];
  /*0048*/     /*0xc0609ca320158000*/   IMAD R2.CC, R6, R10, c [0x0] [0x30];
  /*0050*/     /*0xd070dc4348004000*/   IADD.X R3, R7, c [0x0] [0x34];
  /*0058*/     /*0xe0621ca320158000*/   IMAD R8.CC, R6, R10, c [0x0] [0x38];
  /*0060*/     /*0xf0725c4348004000*/   IADD.X R9, R7, c [0x0] [0x3c];
  /*0068*/     /*0x00019c037000c00c*/   BFE.U32 R6, R0, 0x300;
  /*0070*/     /*0x0c01dc037000c01c*/   BFE.U32 R7, R0, 0x703;
  /*0078*/     /*0x28001c037000c058*/   BFE.U32 R0, R0, 0x160a;
  /*0080*/     /*0x00419c8594000000*/   ST.E [R4], R6; 
  /*0088*/     /*0x0021dc8594000000*/   ST.E [R2], R7; 
  /*0090*/     /*0x00801c8594000000*/   ST.E [R8], R0; 
  /*0098*/     /*0x00001de780000000*/   EXIT;
    ..............................................


    Function : _Z17separateBitFieldsP8bitFieldPiS1_S1_
  /*0000*/     /*0x00005de428004404*/   MOV R1, c [0x1] [0x100];
  /*0008*/     /*0x84021c042c000000*/   S2R R8, SR_Tid_X;
  /*0010*/     /*0x10031de218000000*/   MOV32I R12, 0x4;
  /*0018*/     /*0x1081dce35000c000*/   IMUL.HI R7, R8, 0x4;
  /*0020*/     /*0x80829ca320198000*/   IMAD R10.CC, R8, R12, c [0x0] [0x20];
  /*0028*/     /*0x9072dc4348004000*/   IADD.X R11, R7, c [0x0] [0x24];
  /*0030*/     /*0xa0809ca320198000*/   IMAD R2.CC, R8, R12, c [0x0] [0x28];
  /*0038*/     /*0x00a19c8584000000*/   LD.E R6, [R10];
  /*0040*/     /*0xb070dc4348004000*/   IADD.X R3, R7, c [0x0] [0x2c];
  /*0048*/     /*0xc0811ca320198000*/   IMAD R4.CC, R8, R12, c [0x0] [0x30];
  /*0050*/     /*0xd0715c4348004000*/   IADD.X R5, R7, c [0x0] [0x34];
  /*0058*/     /*0x18229c041d800000*/   I2I.U32.U8 R10, R6.B3;
  /*0060*/     /*0x18225c041d000000*/   I2I.U32.U8 R9, R6.B2;
  /*0068*/     /*0x18201c041c000000*/   I2I.U32.U8 R0, R6; 
  /*0070*/     /*0x18219c041c800000*/   I2I.U32.U8 R6, R6.B1;
  /*0078*/     /*0x2822dc041c000000*/   I2I.U32.U8 R11, R10;
  /*0080*/     /*0x24229c041c000000*/   I2I.U32.U8 R10, R9; 
  /*0088*/     /*0x18225c041c000000*/   I2I.U32.U8 R9, R6; 
  /*0090*/     /*0xe0819ca320198000*/   IMAD R6.CC, R8, R12, c [0x0] [0x38];
  /*0098*/     /*0x60b2dc036000c000*/   SHL R11, R11, 0x18;
  /*00a0*/     /*0x20921c036000c000*/   SHL R8, R9, 0x8;
  /*00a8*/     /*0x40a29c036000c000*/   SHL R10, R10, 0x10;
  /*00b0*/     /*0xf071dc4348004000*/   IADD.X R7, R7, c [0x0] [0x3c];
  /*00b8*/     /*0x00825c4368000000*/   LOP.OR R9, R8, R0; 
  /*00c0*/     /*0x2ca29c4368000000*/   LOP.OR R10, R10, R11;
  /*00c8*/     /*0x74001c036000c000*/   SHL R0, R0, 0x1d;
  /*00d0*/     /*0x58925c036000c000*/   SHL R9, R9, 0x16;
  /*00d8*/     /*0x20a29c4368000000*/   LOP.OR R10, R10, R8; 
  /*00e0*/     /*0x74001c235800c000*/   SHR R0, R0, 0x1d;
  /*00e8*/     /*0x64921c235800c000*/   SHR R8, R9, 0x19;
  /*00f0*/     /*0x28a25c235800c000*/   SHR R9, R10, 0xa;
  /*00f8*/     /*0x00201c8594000000*/   ST.E [R2], R0;
  /*0100*/     /*0x00421c8594000000*/   ST.E [R4], R8;
  /*0108*/     /*0x00625c8594000000*/   ST.E [R6], R9;
  /*0110*/     /*0x00001de780000000*/   EXIT;
    ........................................................

Is there any reason the compiler doesn’t just use bfe automatically when using bit-fields?

In the context of your application, how much performance difference do you observe between the code generated by the compiler vs the code manually optimized using BFE via PTX inline assembly? If your experiments show a significant performance advantage from the use of BFE, I would suggest filing an RFE (request for enhancement) via the bug reporting form linked from the registered developer website. Please state the GPU you are using when filing the report, as tradeoffs between instructions often differ based on architecture.

Please note that using more instructions does not automatically mean the code is slower. Different instructions have different throughput rates. Based on my limited knowledge it seems that the use of BFE would indeed be advantageous here, but I have not quantified the benefit.

The performance difference is significant. I was planning on filing a RFE, but was curious if there was a reason that the compiler had to do things the way it is now. Thanks.

It’s likely the BFE PTX opcode is designed for runtime evaluation (with dynamic bit positions and widths.) But the “manual” method you compare to lets the nvcc compiler determine the subfields at compile time, so it’s much more efficient. nvcc can’t really optimize BFE since it’s an ASM macro and therefore a black box.

You could argue that BFE could be optimized in ptxas to detect compiletime constant arguments in the PTX and switch to more efficient code substitution, but BFE is a pretty rare and minor opcode, so it’s no surprise the compiler guys have focused on other things.

@SPWorley - I think there may be a misunderstanding.

The code that uses bitfields generates very poor ISA that doesn’t actually use the instruction called “bit-field extract”.

The code that manually uses bfe, through inline asm generates perfectly fine ISA.

Both the bitfield itself and the arguments to bfe are compile time constants, so the information available to the compiler should be the same.

You can tell from the resulting ISA that ptxas has already used the fact that the bit positions are compile-time constants to optimize the BFE call to:

BFE.U32 R6, R0, 0x300;

Where the third argument 0x300 is because I want to start at bit 0 and extract 3 bits. If it was determining the position and number of bits to extract at runtime, the third argument would be a register.

I have certainly seen BFE instructions in disassembled code, so it is not like the compiler is completely unaware of the existence of the instruction. In my experience, not many codes use bit fields, so you may simply be the first one to notice this missed optimization opportunity. If you could attach code to your RFE that demonstrates the performance benefit, that would be helpful.

Semi-related tip on BFE.64 and word-straddling bit-fields…

I was just working on a 64-bit struct of bit fields and unfortunately I had a field that straddled the 32-bit boundary between the two words. I overlooked the fact that this can be hazardous. :)

This resulted in the struct being much larger as NVCC chooses to match MSVC’s behavior and bump the boundary-straddling field to the next word.

It’s an implementation defined choice but Microsoft and NVCC seem to agree on how to pack bits: bit fields that would straddle a word boundary are moved to a new word.

One strategy is to simply split any overlapping bit field into two parts (“fieldLo” and “fieldHi”) that straddle the 32-bit boundary and deal with it explicitly later since 64-bit BFE’s on a boundary appear to be mapped to SHR+ISCADD+AND. Translation: BFE.64 ops on a boundary do not map to a single SASS op.

However, on sm_35 a boundary bit field extract is a more elegant SHF.R+AND so you might want to special case that bit field fetch by executing it directly against a 64-bit value unioned with the bit field struct.

So be careful with bit fields. :)

@eelsen, I had to go back today and double-check the status of BFE codegen.

Good news, with CUDA 6.5 there is a BFE being generated across sm_20-sm_52 architectures:

But bit field assignments still don’t seem to generate BFI ops.

Do you observe performance improvements when you code the bit-field-insert operations using inline PTX? I have not looked at bit field operations in quite a while but seem to recall that these are not exactly high-throughput instructions, so an alternate sequence of simple instructions may give the same or better performance. Depending on your results, another RFE may be in order :-)

I’ve never properly benchmarked the ops but they’re much prettier to look at than shift+and combos. :)

On line “0040” in the implicit case you can see that the compiler cleverly chooses to use a signed shift right by 10 to extract the signed top 22 bits of the structure instead of using a BFE. Perhaps that might be an example of where a single SHR is faster than a BFE?

I do not require my machine code to be pretty, just correct and fast. I tend to favor RISC-style approaches to ISAs as complex operations (like BFE and BFI) can de difficult to generate from HLL and it does not usually help if the same operation can be accomplished in N different ways: Do you want that left shift coded as a SHL, a SHF, a PERM, an ISCADD, an IMUL, or an IMAD instruction with each of those going through various different execution units with different throughputs? I probably forgot a few possibilities, couldn’t BFI be used as a left shift, too?

These days the CUDA compiler is aware of an amazing number of clever transformations. I sometimes look at a piece of SASS and cannot immediately figure out how it relates to my source, as there are operations in the SASS that do not exist in my code, such as clever and unexpected uses of IMAD or BFE.

Based on my limited analysis, the majority of the new cleverness seems to come from NVVM (an LLVM-derivative). However, NVVM sometimes is too clever by half: In at least one instance I found the transformation applied was a significant pessimization for NVIDIA GPUs. I do not remember the details (it may have been introducing tons of shifts were none were need), but it got fixed in time for CUDA 6.5.

So at least for ninjas it still pays to be vigilant, and when in doubt, run some experiments if the code generation looks sub-optimal or plain weird.