I think one of the key problems at the moment is the lack of function pointer/subroutine support in kernels. It seems like texture access winds up being translated into in-line assembler during compilation, and everything needed to make the texture thread launch happen needs to be available to the compiler in the same compilation object. If it weren’t inlined, then it might be possible to leave a dangling symbol and have the driver match up everything J.I.T at runtime, something like the way a modern shared library runtime linker works. That has side effects though - program launch times could be much longer than now, especially with complex applications, and then you have the new situation where a CUDA app that compiles without error doesn’t run and returns with a bunch of symbol or object errors. Which in many ways is a harder and more complex set of problems to debug than now. Also it adds additional functionality, complexity, and overhead to the driver which is already a larger and complex piece of code.
With the arrival of Fermi, it will be interesting to see how the tool chain develops, but as it is now I don’t see how it could be done.