Fast conversion in CUDA of vector of integers to a vector of bits

I want to do binary matrix multiplication with XOR (bmmaBitOpXOR) and popcount (bmmaAccumulateOpPOPC) using the nvcuda::wmma functionality. The destination fragment where the result is accumulated with popcount is an 8x8 32-bit integer matrix. I want to return again to binary/boolean with thresholding. How is it most efficient to convert this matrix to a 64-bit integer where each bit is the result of a comparison of the corresponding matrix element with a threshold value?
After copying to memory with store_matrix_sync I would imagine that only one thread should work per result fragment because it will be inefficient to synchronize bit manipulation of the resulting 64-bit integer from all threads in a warp.