Question, do you know all of the pairs beforehand?
Could you pre-calculate all of the pairs?
If so, why not just create a 2-D vector where 1 direction is the number of pairs and the other are the pairs. This way you just end up with a vector, passing that to the GPU and have every thread or block or whatever size you end up needing operate on one set.
I don’t think mapping a thread/block id to a pair of numbers is really the idea here. Having thread 0 work on 4,3 and thread 1 work on 2,0 or any other order like that doesn’t matter. It shouldn’t matter which thread or block work on what pair just that all pairs are created equally (data wise.)
Therefore in this situation you would end up with say 10 threads working on these 10 pairs. Doesn’t matter which thread is working on which pair, just that they are all working on their own pair. If you bumped it up to say 15 or 21 same concept.
The bigger problem you are trying to solve is to ensure that all of the sets are the same and can be accessed the same (ie inside the thread it shouldn’t care which pair it is working on) that way you can get the parallelization you are looking for.
At least that is my interpretation of your question.