If you mean the overall cycles during which actual work was done by the process, vs the maximum amount of cycles during the execution time, I agree.
Instructions might be a bit misleading, since this would mean you are not be able to achieve a 100% compute utilization with multi-cycle instructions.
I think, as so often, the answer to your question is: It depends.
If you have no divergent branches the compute utilization should converge to 100%.
In a kernel with 1 divergent branch the theoretical compute utilization will be 50% if I remember that correctly. The profiler assumes they are taken 50% each. In reality you will sometimes see a higher compute utilization there. This is due to the fact that real data often produces similar data in consecutive memory regions, resulting in full warps taking the same branch.
The major problem with telling what a good compute capability is that it most likely depends on the type of kernel you are building, and the amount of effort your are putting in it. Some applications might not allow creating a kernel which allows efficient processing on a GPU. Other applications can become faster if you optimize the processing flow of your data.
Additionally if you have a task which is very load/story intensive, while you will not do alot arithmetic with the read data, your compute utilization wont look very good either. If you do not have to do anything else with the data, then it is okay, it’s just a limit you will need to accept.
I would suggest to critically review the percent of compute utilization the profiler is showing you, and always comparing it to what you would expect from your specific application.