Inefficient register use?

I’m trying to get my function to use <= 32 registers so 512 concurrent threads can run:

if (can_register_block)
float scratch_in[4], scratch_kernel[4];

for (uint x = 0; x < filter_w; ++x)
sum += …

NVCC generated code consumes 36 registers. When I comment out the else part, it goes down to 29, which doesn’t make sense.

It’s clear all the registers used in the else part can reuse the registers in the if part. The else part would extend the live ranges of other values,
but I still don’t think it needs this many. Does anyone know how to fix and to save me from inspecting PTX?

Two other possibilities might explain what you see:

  • The if and else blocks are short enough to be executed with predicated instructions. If that is true, then you can’t reuse the registers between the two branches.

  • Commenting out the else part caused the dead code optimizer to remove some other code which lowered the overall register usage.

Have you tried the --maxrregcount option? It might start to spill values to local memory, but it is worth experimenting with just to see if the performance is acceptable.

I just found out setting max_reg_count to 32 fixes the problem. But why when max reg count > 32, does NVCC choose to use more registers? It’s not spilling to local mem.

Also I noticed, always executing the former else part, reduces #registers to 26. Meanwhile, I’ve rewrote my code, and the problem is no longer there, even with maxregcount = 128. I guess NVCC was being silly.

Ok, my first idea is stupid. The first for loop didn’t register in my head properly. :) That branch clearly can’t be predicated.

Have you tried commenting out just the if part of the branch? It’s possible the else branch alone is what pushes the register count up to 36.

I think really understanding this will require looking at the PTX or the .cubin. Is filter_w known at compile time? Could loop unrolling be doing something here as well?

OK. I tried commenting out the if branch and the register usage becomes 26.