Assembly/Machine code for gpus? Is it possible?

Just a thought, that I’d like some input on.

Is it possbible to move below C, to an assembly/machine code level on the gpus?

I started thinking about what measures could be taken to do “extreme” optimization.
I have an inner-inner-inner-inner loop of 25 lines that take up something like >90% of my entire execution time.

Find out what PTX is (download the spec) and if that is not low-level enough for you, try decuda.

Thanks, I will have a look at it. :D