How to Handle Fixed-Point Arithmetic with Custom Bit Sizes in CUDA?

I need to perform fixed-point arithmetic operations with custom bit sizes, but I’m unsure how to handle this cleanly in CUDA.

Are there any tricks, intrinsic functions, or best practices to implement fixed-point operations efficiently? If not, how can I at least manage 64-bit fixed-point arithmetic effectively in CUDA?

Any insights or code snippets would be greatly appreciated!