The GPU generally likes to have full occupancy. One of the most important reasons for this is due to the idea of latency hiding. You can find many descriptions of this in various forum write-ups, and it is covered in an organized fashion in unit 3 of this online tutorial series.
Briefly, just because the SM has, say, 24 warps, does not mean in all cases (i.e. in every clock cycle) that it can find a warp that is ready to issue a new instruction. But the SM (or SMSP) is operating at highest throughput when it can issue an instruction in every clock cycle. Warps can be stalled for various reasons, the most common one being waiting on a dependency to be satisfied, before execution (of that warp/threads) can continue.
If you had 24 warps (assigned to a SM), then you will have up to 24 warps to choose from. If you had 48 warps, then you will have 48 warps to choose from (the details are more complicated than this, because a cc8.9 SM is really broken into 4 SMSPs, but for general understanding of the idea of latency hiding, we can consider things at an aggregate level.)
If you have 48 warps to choose from, then in some cases, it will be “less likely” (as compared to the 24 warp case) that in a given clock cycle, there are no “eligible warps”. In that case, average code throughput increases or could be higher.
Again, this is not a blanket guarantee. Higher occupancy is somewhat correlated to higher performance, but it is not 100% correlated. I can’t say whether a block size of 64 would actually help your code, from a performance perspective.
That isn’t the definition of occupancy, at least not the way NVIDIA tools use the word occupancy. Occupancy is the number of threads actually resident on a SM compared to the number of threads that could be resident on a SM. Even this has a couple different ways to look at it, but neither correspond to your statement. Your statement (" all the 32 SIMT units have something to do") is closer to the notion of utilization, as presented by Nsight Compute e.g. in the SOL report section. The two bars that Nsight compute presents in the bar chart in that section refer to SM utilization and memory utilization. The SM utilization is roughly similar, in my view, to your statement (" all the 32 SIMT units have something to do") although we could descend into another discussion, because a GPU SM does not consist of or contain “32 SIMT units”. Nevertheless, the notion you have expressed there, in my view, is related to utilization. And latency hiding is closely connected to utilization, and is reflected in the SM utilization reported by Nsight Compute.
Utilization is considered in aggregate, but taking into account cycle-by-cycle behavior. That is, if you can issue an instruction in the SM 50% of the time, then the utilization will be reported at approximately 50%. And yes, we could dissect that statement as well.
But the point I want to make is, if you know for certain that your utilization is at 100% (that is, in every clock cycle, in every SMSP of every SM, there is at least 1 eligible warp), then in my view simply increasing occupancy is unlikely to result in significant performance benefit. The most obvious path for increased occupancy to result in increased performance is if there is a corresponding increase in SM utilization (or perhaps memory utilization).