I’ve answered exactly that in Method A here before:
https://forums.developer.nvidia.com/t/whats-your-solution-to-get-all-hit-primitives-of-multiple-rays/239528/2
Read the comment on atomic congestion there to find the faster matrix addressing.
Not sure what visibility tests you’re implementing exactly, but in case of visibility between triangles or meshes, that could actually be optimized perfectly by ordering the visibility tests (== the rays shot between known scene elements) according to their result location and then you would only need to set the bit inside some register in device code and write it out to the result matrix in 32 bit words without atomics once 32 results are gathered.
Also if your visibility test is bijective, the result matrix is symmetric and you only need to calculate and set the results of the upper half.