multiple xor

Hi everyone,

I am a beginner in cuda programming. Here is the problem statement, we have two arrays: The first array is an array of 0’s and 1’s (Let’s call it arr1). size of arr1 is 150 and it contains 10 binary numbers. length of each binary number in arr1 is 15.

The second array is an array of 0’s and 1’s (Let’s call it arr2). size of arr2 is 310 and it contains 10 binary numbers. length of each binary number in arr2 is 31.

What we want to do is to fill 2 arrays (call them xor1 and xor2). The size of xor1 will be 135, and size of xor2 will be 279. xor1 and xor2 are filled according to the following code (C++):

for(int i = 0; i < 9; i++)
{
    for(int j = 0; j < 15; j++)
    {
        xor1[i * 15 + j] = arr1[i * 15 + j] ^ arr1[(i+1) * 15 + j];
    }  
    for(int j = 0; j < 31; j++)
    {
        xor2[i * 31 + j] = arr2[i * 31 + j] ^ arr2[(i+1) * 31 + j];
    }  
}

What is the efficient way of implementing this in cuda? Is there any way to calculate xor1 and xor2 at the same time?

Thank you for your time in advance

“Here is the problem statement”

it is rather refreshing to note a properly stated problem

“In each group of 10 numbers”

lets call it a subset
subsets of xor are independent of each other, as they seemingly only depend on arr
hence, you could easily assign a thread-block or thread-blocks to each subset of xor, to calculate these subsets concurrently

i suppose a single thread, rather than multiple threads, would calculate a single element of each subset

i am not sure how you are to forward the data to the device
you are using a ‘custom’ data type, given that the length thereof is not really standard
either the host must chop and pre-package the data in an accepted format from the vantage point of the device, or the device would have to be aware that it needs to pre-process the incoming data first
otherwise the threads of a thread-block would have a hard time to access the individual data they are supposed to work in on