I have a quite complex kernel which uses too much registers (38), and so the parallelism is not very good. (the occupancy calculator gives : 1 Thread Block per MP and 17% occupancy)
I tried to reduce the register uses at maximum in my code but it isn’t sufficient. I looked at the ptx where it seems that too much different registers are used when one could do many sequential things.
So I have to modify the ptx file and recompile from it.
Is it possible ?
If so, could you tell me how I have to recompile from this compilation phase and with which options ?