If you want to do this level of tuning, a threadblock can query its SM ID and organize its work that way. For example if you launching a kernel with enough threadblocks to fill a GPU, then each threadblock can query its SM ID, and then choose to do a portion of the work based on SM ID rather than using built-in threadblock variables.
This, for example, would allow you to arrange work so as to provide benefit from the L1 cache to multiple threadblocks. If you have multiple threadblocks per SM, you still need to sort out which is which perhaps using a per-SM atomic like what striker159 suggested.
You can query the sm id like this.
Here is an example of a code that does SM-specialization, and allows for multiple blocks per SM.
There are a variety of other code examples that do various things based on the SM ID. Here are some examples: 1 2
I haven’t read the paper you indicated.