I’m trying to get my function to use <= 32 registers so 512 concurrent threads can run:
float scratch_in, scratch_kernel;
for (uint x = 0; x < filter_w; ++x)
sum += …
NVCC generated code consumes 36 registers. When I comment out the else part, it goes down to 29, which doesn’t make sense.
It’s clear all the registers used in the else part can reuse the registers in the if part. The else part would extend the live ranges of other values,
but I still don’t think it needs this many. Does anyone know how to fix and to save me from inspecting PTX?