Not directly related to Maxwell, but I’m pleased to see improved code generation in CUDA 6.0. After recompiling my image processing codes, the instruction count reduced by 12% and kernel time by 22% !
One thing I’ve always been bothered by is the very inefficient array indexing code. Unlike x86 which can compute
index * scale + offset + constOffset with a single load/store instruction, CUDA actually uses multiply and add instructions to do it (you can translate the array index into an induction variable, but that increases register use). 64 bit addressing makes it worse by doubling the # instructions.
It took me a while to realize why my simple code had 2 multiplies for each memory load:
addressLow32 = IMAD(index, scale, pointerBaseLow32) // compute lower 32 bits of address
addressHigh32 = IMAD.hi(index, scale, pointerBaseHigh32) // compute upper 32 bits
With CUDA 6, the address generation is improved to:
addressLow32 = IMAD(index, scale, pointerBaseLow32)
sign = index < 0 ? 0xffffffff : 0
addressHigh32 = IADD.X(sign, pointerBaseUpper32)
which could be better for throughput, but makes inspecting assembly code even harder by littering it with more address calculation.