The granularity of register allocation is larger than 1. The details differ by architecture. There may also be factors other than the number of registers that lead to the limit of 384 threads.
I do not know whether one can coax the details of the occupancy computation out of Nsight Computer, but the old Excel spreadsheet-based occupancy calculator, while deprecated, is still around and should allow you to track the exact details of the occupancy calculation, including the granularity of register allocation.
Thanks, you are correct. Looking at the excel sheet, I got to know that I have to consider the Warp allocation granularity and Register allocation unit size and Register allocation granularity parameters of the architecture to do a proper calculation.