What I’m doing is when I have meshes I check which meshes are visible on each mesh.

When there are N meshes, allocate an N x N array and check whether it is visible.

What I’m curious about is that I currently have the units of the result array being 1byte.
However, I would like to use it in bit units (ex, std::bitset).

Not sure what visibility tests you’re implementing exactly, but in case of visibility between triangles or meshes, that could actually be optimized perfectly by ordering the visibility tests (== the rays shot between known scene elements) according to their result location and then you would only need to set the bit inside some register in device code and write it out to the result matrix in 32 bit words without atomics once 32 results are gathered.

Also if your visibility test is bijective, the result matrix is symmetric and you only need to calculate and set the results of the upper half.

If you’re really only determining the visibility between triangles, there is an exact number of results, which is a different case than the “which rays hit which primitives” inside the linked post.

You wouldn’t need atomics to write the bit results in that case. Instead you could let each each launch index test 8, 16 or 32 of these triangle-to-triangle visibility results in exactly the order of the result bits inside the upper right matrix.
Means each launch index would be able to calculate exactly which triangle-to-triangle ray connections would need to be tested and then write the unsigned byte or unsigned short or unsigned int containing the result bits to their memory locations in the matrix output buffer.

Only when you would determine only one visibility between two triangles per launch index, you would need to use atomicOr to write the result into the output buffer.