I have a use case in my algorithm for transposing a 32x32-bit matrix within a warp. Each thread has a uint32 value v, containing one bit to be sent to each other thread in the warp. Each thread k needs to receive a uint32 containing bit k from the ‘v’ value held by all other threads. It’s not unlike ballot_sync(), except that e.g. thread 5 would receive a packed uint32 containing bit 5 from all the other threads, and so on. (Rather than treating the parameter as a boolean.)
I’m currently doing this with five sequential __shfl_xor_sync() calls, plus a ton of shifting and masking. But it struck me that it should be possible (and highly useful IMO) to implement such an operation in hardware, without much more complexity than the current set of warp shuffles and ballot mechanisms. For a future architecture, does this sound like something that might be theoretically possible? If it were implemented, it could also have naural variations related to warp size; e.g. with a warp size of 16, each thread would send 2 bits to each other thread, and so on. Curious what you think?