Given that GPU registers comprise 32 bits, any integer computation on wider integers will require those integers to be represented as arrays of 32-bit (uint32_t) chunks. One speaks of arbitrary-precision integer arithmetic when the number of chunks is freely selectable by the programmer.
Computations on large integers, whether on GPUs or CPUs, are always broken into such “limbs”, usually either of 32 bits or 64 bits depending on the register width of the hardware. A well-known and much-used library of this sort for CPUs is GMP. Integer computations are by their nature exact, unless the result becomes so large it cannot be represented in the number of bits allocated (that is, there is integer overflow).
I have not used the libraries mentioned by @cbuchner1, but I would be highly surprised if either cannot handle computation at widths of 256, 512, and 1024 bits, because integers of that size are simple enough to handle that you could implement basic arithmetic yourself with moderate effort.
While I have programmed some wide integer multiplies and adds with CUDA in the past, I have never had a need to use an arbitrary-precision integer library. I am under the impression that long-time forum participant @cbuchner1 has practical experience with that subject matter.
In order to exploit the copious computational resources of the GPU, you will need massive parallelism (think on the order of 10K threads). Whether your use case provides that, I do not know. You may want to do a literature search and do some prototyping / benchmarking using GPU-accelerated instances provided by cloud providers such as AWS before spending significant amounts of money for your own hardware. I am aware that people doing prime number searches have found good ways of exploiting GPUs, also people doing cryptographic computations.
Sorry, can’t help you there. I have a machine with AVX-512 support here but have not looked into targeting it; AVX2 is far as I have gotten in terms of practical SIMD programming experience. In general I find programming using SIMD intrinsics cumbersome and CUDA’s programming model much more straightforward to use. Again, the literature may already have some performance data for your use case. Google Scholar is a useful tool for searching what is out there in terms of publications.