There is no way to “force” this in the general case. For example, if you happen to run your code on a GPU with only 1 SM, then there is not a way to have all 42 thread blocks be resident and active. Other factors that you haven’t specified, such as desired threads per block, may also preclude your goal on other GPUs.
Since we have now dispensed with the general case, this becomes a matter of occupancy. You might want to learn about:
- What occupancy is
- How to use the occupancy calculator spreadsheet
- How to use the occupancy API - a good example is given in the programming guide under the cooperative groups section
Occupancy is discussed in many places, google will help you locate more references.
Your goal would be to use the tools (static - occupancy calculator spreadsheet, dynamic - occupancy API) to construct a kernel launch that does not exceed 100% occupancy.