It might be possible. I did an optimization on a particular integer-only code, doing some range analysis, and moving some data from integer to floating point, and got a significant speed up. So I know its possible in some cases. However, in my case, it was a GPU that predated Volta, and it did not have native 32-bit integer multiply. Strangely enough, it was indexing calculations that I moved, so even the cost of integer->float->integer conversion was amortized, and still yielded a benefit. You can imagine that index calculations will often involve a multiply and also may be inherently easier to do range analysis on. Since Volta and beyond has “full rate” integer multiply, it might be harder to get an attractive benefit this way. I believe NVIDIA’s marketing materials more-or-less make this exact point, regarding integer arithmetic on Volta. So I wouldn’t put this at the top of your list without more evidence.
People often approach performance this way - trying to apply specific pieces of knowledge to see what will happen. I do it too. It’s not a horrible way to learn. You run experiments, then do your best to explain the results. However without knowing what your code is limited by, this could all be very academic, or irrelevant. And taking on a large code refactoring exercise using this sort of mentality might not be the best use of your time. For a large amount of expended effort, I’d want to have some assurance or likelihood of a payoff at the other end.
My advice when teaching CUDA is that you should have a few (possibly just two, but maybe as many as 10 or so) basic paradigms understood so that you “tend” to write performant code “naturally”. For everything else, you leave that to profiler guided performance analysis and optimization. Make aggressive use of library implementations where possible.
I can’t think of any of the top 10 or so paradigms - except possibly the suggestion by njuffa about using 16 bit packed data to make more efficient use of memory - that would apply to what we’re talking about here.
My suggestion would be to write your code in a way that seems natural, understandable, and maintainable to you, and then let the profiler guide you.
Going after compute-boundedness (mostly what’s being discussed here) without solid evidence is very often misguided, in my experience. Many people think the arithmetic in their code is important, when actually it is they way they use memory that is most important.
Obviously I can’t speak to your specific case, YMMV, take with a grain of salt, ignore if annoying.