XOR single array with multiple array"s

I want to XOR a single array with a bunch of other arrays (100k) and count the set bits of every xor-result. The size of a single array is around 10k bits.

As an example I started writing the below :

My array A contains:[0 0 1 0 0 1 1 0 1 0] My array B contains:[1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 1 1 1 0 1]

So, now I need to perform xor operation in pycuda, in such a way that taking 1st 10 bits and then next 10bits of array B and so on…

So , for the above example the result will be: [1 1 0 0 0 0 1 1 1 1] ^[1 1 0 0 1 1 0 1 1 1]=[0 0 0 0 1 1 1 0 0 0]=[3](The number of 1"s) [1 1 0 0 0 0 1 1 1 1] ^[1 0 0 1 0 1 1 1 0 1]=[0 1 0 1 0 1 0 0 1 0]=[4]

But however, my result is coming as [8589934595 0]but the answer should be[3 4]

My code goes like this

my_bitset_size = 10
my_bunch_size = 2

mod_1 = SourceModule("""__global__ void kernelXOR(uint * bitset, uint * bunch, int * set_bits, int bitset_size, int bunch_size) {

int tid = blockIdx.x*blockDim.x + threadIdx.x;;

if (tid < bunch_size){      // 1 Thread for each bitset in the 'bunch'
    int sum = 0;
    uint xor_res = 0;
    for (int i = 0; i < bitset_size; ++i){  // Iterate through every uint-block of the bitsets
        xor_res = bitset[i] ^ bunch[bitset_size * tid + i];
        sum += __popc(xor_res);
    }
    set_bits[tid] = sum;
}}""")
a = numpy.random.randint(2,size = 10)
b = numpy.random.randint(2,size = 20)
d_r = numpy.zeros((my_bunch_size,), dtype=int)
d_gpu = drv.mem_alloc(d_r.nbytes)
number = int((my_bunch_size / 31) + 32)


xor_function = mod_1.get_function("kernelXOR")
xor_function(
     drv.In(a), drv.In(b),d_gpu,numpy.int32(my_bitset_size),numpy.int32(my_bunch_size),
    block=(256,1,1), grid= (1,1))

drv.memcpy_dtoh(d_r, d_gpu)

I have also referred to Cuda: XOR single bitset with array of bitsets and trying to implement the same but however facing some issues. It would be great to provide insights on this issue

Hello, this forum is dedicated to discussions related to using the cuda-memcheck tools.
Questions related to CUDA can be raised at CUDA - NVIDIA Developer Forums