If you want to go the nvidia supported route, you can use the driver level api to manually load in PTX files. See around page 133 of the CudaReferenceManual, specifically the cuModuleLoadDataEx function. This lets you set the optimization level, target architecture, etc. You won’t be able to use the CUDA runtime API though.
Alternatively, you can use the ptxas tool, which is the PTX assembler. It will perform the same task on a static PTX file; it will convert a .ptx file into a .cubin file. Even if you go this route you will still need to use the CUDA driver API to load in the .cubin files.
If you want a greater amount of control over the optimizations that are applied, you can write an optimization pass using Ocelot http://code.google.com/p/gpuocelot/. Ocelot also has an interface for loading a kernel from inlined PTX or an input file and calling it directly using the CUDA runtime API. Going this route would require you to compile and link against Ocelot.