XOR single array with multiple array"s

I want to XOR a single array with a bunch of other arrays (100k) and count the set bits of every xor-result. The size of a single array is around 10k bits.

As an example I started writing the below :

My array A contains:[0 0 1 0 0 1 1 0 1 0] My array B contains:[1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 1 1 1 0 1]

So, now I need to perform xor operation in pycuda, in such a way that taking 1st 10 bits and then next 10bits of array B and so on…

So , for the above example the result will be: [1 1 0 0 0 0 1 1 1 1] ^[1 1 0 0 1 1 0 1 1 1]=[0 0 0 0 1 1 1 0 0 0]=[3](The number of 1"s) [1 1 0 0 0 0 1 1 1 1] ^[1 0 0 1 0 1 1 1 0 1]=[0 1 0 1 0 1 0 0 1 0]=[4]

But however, my result is coming as [8589934595 0]but the answer should be[3 4]

My code goes like this

my_bitset_size = 10
my_bunch_size = 2

mod_1 = SourceModule("""__global__ void kernelXOR(uint * bitset, uint * bunch, int * set_bits, int bitset_size, int bunch_size) {

int tid = blockIdx.x*blockDim.x + threadIdx.x;;

if (tid < bunch_size){      // 1 Thread for each bitset in the 'bunch'
    int sum = 0;
    uint xor_res = 0;
    for (int i = 0; i < bitset_size; ++i){  // Iterate through every uint-block of the bitsets
        xor_res = bitset[i] ^ bunch[bitset_size * tid + i];
        sum += __popc(xor_res);
    }
    set_bits[tid] = sum;
}}""")
a = numpy.random.randint(2,size = 10)
b = numpy.random.randint(2,size = 20)
d_r = numpy.zeros((my_bunch_size,), dtype=int)
d_gpu = drv.mem_alloc(d_r.nbytes)
number = int((my_bunch_size / 31) + 32)


xor_function = mod_1.get_function("kernelXOR")
xor_function(
     drv.In(a), drv.In(b),d_gpu,numpy.int32(my_bitset_size),numpy.int32(my_bunch_size),
    block=(256,1,1), grid= (1,1))

drv.memcpy_dtoh(d_r, d_gpu)

I have also referred to Cuda: XOR single bitset with array of bitsets and trying to implement the same but however facing some issues. It would be great to provide insights on this issue