Reducing binary size while using accelerated libs

It sounds like you’re asking for prune functionality to be built-in to nvcc. I’m not aware of any such nvcc option that would accomplish what you have done here.

If you’d like to see a change in behavior in CUDA, my suggestion would be to file a bug. I don’t know of any specific reasons why prune functionality could not be built into nvcc, although there may be reasons I haven’t thought of. As you’ve already discovered, it may not be a simple matter of matching the user’s specified arch switches.