MUFU.RCP provides a fast way to generate a reciprocal. Generating an accurate reciprocal approximation would be slower when using only integer operations and / or require big(-ish) tables.
When operating on integers, various variants of Newton-Raphson iteration require that the error is strictly to one side throughout to deliver correct results: either always an overestimate or always an underestimate. If I recall correctly, MUFU.RCP delivers results within ±1 ulp of the mathematical result, so subtracting 1 ulp would ensure the starting guess is always an underestimate. In uint32_t computation, addition of 0xFFFFFFFE is the same as subtracting 2.
While I did work on multiple versions of integer division emulation for CUDA while at NVIDIA, I do not recall all the details, nor could I either confirm or deny that the code shown here matches any of these versions. The HFMA2.MMA R4, -RZ, RZ, 0, 0 looks highly unusual, I cannot figure out what it does here. Looks like some sort of compiler magic, maybe some clever way of selecting either 0x0 vs 0xffffffff.
How did this question arise? Have you found a more efficient way of emulating 32-bit integer division?