Can this be parallelized?

I have two arrays and I want to find the same guys in both of it, so can this be parallelized? any idea?

Thanks :)

What exactly do you want to do? OK, you have two arrays. Do you want to find all the values that are in common between the two? Do you want to only answer the question if there is some value shared? Do you want to know the number of values shared? Do you need to know their locations? Etc. I have little doubt each can be solved in parallel but some will be easier than others.

Sorry for that, I want to find the values :)

You can sort one array, then binary search each element of the other from it. If the range is small enough, you can do histograming instead.
However, the bsearch approach is O((n+m)logn), and it’s relatively easy to do the problem in O(n+m) in serial with radix sort+merge sort. Such a naive parallelization isn’t likely to be efficient.

How large are the arrays? Will they always be the same size between invocations? Are the two arrays being compared the same size?

It is already sorted, by the way, different size…
I am wondering if it can be done parallelization…

Also to asadafag:
When I am checking the prefix-scan, it seems that the copy part, say, from global to share and from share to global, takes most time. So can I compress the data and then copy it from global to share? That means you reduce the data you transfer and add the computation workload. Will that be faster?

thanks, guys! :)

That’s possible if the compression isn’t costly. I’m planning to test a short-to-int scan some time. However, one disadvantage is that in-place scan is infeasible for compressed data, and may consume more memory in total.

No one has any idea? I try the binary search and extract the same ones from a 1M array and 2M array, it takes 16ms…

What’s the range of your values? Is there any special property?
If you can afford a costly precompute, you can build a hash of A on CPU or using sm11’s atomic operation. A hash look up may be better than a bsearch even with completely random access pattern… If you just use a modulo hash, the memory access pattern would even be coherent.

Help!!! I tried 3 days in debugging this little thing… <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

I use binary search to find the same things between two sorted arrays, 1M and 16M separately.

I first divide the 1M array into groups with 2K elements each group, then use the binary search to find the position of the last element in each group. Below is the stupid code :

   int tIDx = threadIdx.x;
   int bIDx = blockIdx.x;
   int BlockOff=bIDx<<8;     // 256 thread, each deals with one element
   int address0 = tIDx + BlockOff;
   int temp=d_Array[address0]; // d_Array stores the last elements of each group       
   int high;
   int low;
   unsigned int middle;
   high=HIGH;      // HIGH = 16M here
   low=LOW;        // LOW = 0 here 
       for(high=HIGH,low=LOW; low<high; )   // simple binary search here
       {                                                 // the low stops in the position which               equals to or greater than temp
           middle= (low+high)>>1;                   
           if (temp1>d_SourceArray[middle])
              low = middle + 1; 
           else 
              high = middle;              
       }                           
       d_GPos[address0]=low;             // store the position in d_GPos

The idea is simple and the code is also simple, but when I check the d_GPos array, the result shows only the first 128 threads work properly… I use 256 threads each block and it seems funny…

Help guys, 3 days in debugging this…

Thanks ^_^

Someone can help? I am very very very thankful …

Why are you using both “temp” and “temp1”?

I am sorry, it is:
int tIDx = threadIdx.x;
int bIDx = blockIdx.x;
int BlockOff=bIDx<<8; // 256 thread, each deals with one element
int address0 = tIDx + BlockOff;
int temp=d_Array[address0]; // d_Array stores the last elements of each group
int high;
int low;
unsigned int middle;
high=HIGH; // HIGH = 16M here
low=LOW; // LOW = 0 here
for(high=HIGH,low=LOW; low<high; ) // simple binary search here
{ // the low stops in the position which equals to or greater than temp
middle= (low+high)>>1;
if (temp>d_SourceArray[middle])
low = middle + 1;
else
high = middle;
}
d_GPos[address0]=low; // store the position in d_GPos

It seems only the first 128 threads work… I can not figure out why :(

Could you please also post how you allocate the memory and how you call the global function?

cudaMalloc((void**) &d_SourceArray, 16M);

cudaMalloc((void**) &d_Array, 0.5K);

cudaMalloc((void**) &d_GPos, 0.5K);

when I call the function, I use:

bsearch<<<16M/256, 256>>>(d_Array,d_SourceArray,d_GPos);

This is really wierd :(

I’m a little confused…

You are using a 512 elements array d_Array, right? But you have 256 threads. And why are you using 16M/256 blocks?

Oh sorry, the call function should be

bsearch<<<0.5K/256, 256>>>(d_Array,d_SourceArray,d_GPos);

I type the wrong thing, the problem is still there…

Still that question:

You are using a 512 elements array d_Array, right? But you have 256 threads. Thus you can only process half of the array?

Yes, but each block process 256 numbers and I have two blocks…

According to these 2 lines:

int BlockOff=bIDx<<8; // 256 thread, each deals with one element

int address0 = tIDx + BlockOff;

It seems you are accessing element 0~255 for first block, then 1~256 for the second block. right? Is this the reason for the problem?