The size of the per-SM instruction cache can be determine through a microbenchmark that uses a loop of increasing size: there is a small but measurable drop in execution speed once the loop body exceeds the Icache size. I performed such an experiment in the past, and from my recollection the Icache size was 4 KB, but I don’t recall what part I measured on (most likely a K20) and the size may easily differ between different architectures.
The GPU instructions in general are 8 bytes long, and for a Maxwell (sm_5x) architecture one can easily see from a disassembled binary that there are an additional 8 bytes of control information added for every three actual instructions. So a 4 KB Icache would hold 384 instructions for an sm_5x part. In light of aggressive inlining by the compiler, loop bodies for various real-life scenarios can exceed this size. In my (pre-Maxwell) experience the performance penalty on a loop that exceeds the ICache was never larger than about 3%.
So I personally do not worry about Icache misses. As with other stall events on the GPU a large number of threads running with zero-overhead context switch are generally able to cover the latency well.
It is unclear what kind of trade-offs between Icache and Dcache you are thinking of. Something like switch statements versus function pointers? Recomputation versus lookup tables? There are other mechanisms that impact those decisions that are likely higher impact, such as branch divergence and serialization.