Optimizing for many concurrent kernels

I am optimizing a kernel that I expect will run fairly quickly. However, I also plan to have many invocations of this kernel that will run concurrently on the GPU (probably via streams). I will likely also have other fast kernels running concurrently with this kernel.

My question is: Should I still approach my optimization with the intent of making a single invocation of the kernel finish as quickly as possible? Or would I be able to sacrifice the speed of a single invocation to make multiple concurrent invocations finish faster? For example, I could do more work per thread and launch fewer blocks even if this makes a single invocation slower. Or I could use fewer registers or shared memory. Also, is occupancy a factor here?

Yes to all of it. It depends on your application requirements (e.g. is latency a factor or only bandwidth); it depends on your GPU and on your kernel resource requirements (including memory working size for cache hit rate) and so on. You should profile your kernel runs, try out different configurations, calculate theoretical optimums and bottlenecks.