What kind of optimizations are on PTX code

I am auto generating some PTX code, which I plan to load using the cuModuleLoadDataEx method.
I am wondering: what kind of optimizations are done by the PTX → cubin compiler/assemler?
Where do I find information about this?

My concern right now is: Do I need to care about my register usage or does the assembler optimize that anyway?

BR Troels

In my experience the optimizer does do register allocation on PTX.