Speed up arithmetic intense kernels

I have a kernel that is dominated by compute (high arithmetic intensity around 60 flops per byte). The bottleneck pipe is XU. I have already switched to single precision cuda intrinsics. Any way for further sped up?