I want to XOR a single array with a bunch of other arrays (100k) and count the set bits of every xor-result. The size of a single array is around 10k bits.
As an example I started writing the below :
My array A contains:[0 0 1 0 0 1 1 0 1 0] My array B contains:[1 1 0 0 1 1 0 1 1 1 1 0 0 1 0 1 1 1 0 1]
So, now I need to perform xor operation in pycuda, in such a way that taking 1st 10 bits and then next 10bits of array B and so on…
So , for the above example the result will be: [1 1 0 0 0 0 1 1 1 1] [1]=[0 0 0 0 1 1 1 0 0 0]=[3](The number of 1"s) [1 1 0 0 0 0 1 1 1 1] [2]=[0 1 0 1 0 1 0 0 1 0]=[4]
But however, my result is coming as [8589934595 0]but the answer should be[3 4]
My code goes like this
my_bitset_size = 10
my_bunch_size = 2
mod_1 = SourceModule("""__global__ void kernelXOR(uint * bitset, uint * bunch, int * set_bits, int bitset_size, int bunch_size) {
int tid = blockIdx.x*blockDim.x + threadIdx.x;;
if (tid < bunch_size){ // 1 Thread for each bitset in the 'bunch'
int sum = 0;
uint xor_res = 0;
for (int i = 0; i < bitset_size; ++i){ // Iterate through every uint-block of the bitsets
xor_res = bitset[i] ^ bunch[bitset_size * tid + i];
sum += __popc(xor_res);
}
set_bits[tid] = sum;
}}""")
a = numpy.random.randint(2,size = 10)
b = numpy.random.randint(2,size = 20)
d_r = numpy.zeros((my_bunch_size,), dtype=int)
d_gpu = drv.mem_alloc(d_r.nbytes)
number = int((my_bunch_size / 31) + 32)
xor_function = mod_1.get_function("kernelXOR")
xor_function(
drv.In(a), drv.In(b),d_gpu,numpy.int32(my_bitset_size),numpy.int32(my_bunch_size),
block=(256,1,1), grid= (1,1))
drv.memcpy_dtoh(d_r, d_gpu)
I have also referred to Cuda: XOR single bitset with array of bitsets and trying to implement the same but however facing some issues. It would be great to provide insights on this issue