Just a thought, that I’d like some input on.
Is it possbible to move below C, to an assembly/machine code level on the gpus?
I started thinking about what measures could be taken to do “extreme” optimization.
I have an inner-inner-inner-inner loop of 25 lines that take up something like >90% of my entire execution time.