In CUDA C++ Programming Guide the guide mentions dynamic reassignment of register capacity among warp-groups to support even larger matrices. I didn’t see any special instructions for this in the PTX guide and am looking for more info on how this works. I got access to a HGX machine and will be optimizing ML models on it in the future, so this topic is something I need to familiarize myself with.