Can this be parallelized?

dingshuai1985 · October 18, 2007, 8:21pm

I have two arrays and I want to find the same guys in both of it, so can this be parallelized? any idea?

Thanks :)

wildcat4096 · October 19, 2007, 2:32am

What exactly do you want to do? OK, you have two arrays. Do you want to find all the values that are in common between the two? Do you want to only answer the question if there is some value shared? Do you want to know the number of values shared? Do you need to know their locations? Etc. I have little doubt each can be solved in parallel but some will be easier than others.

dingshuai1985 · October 19, 2007, 3:12am

Sorry for that, I want to find the values :)

asadafag · October 19, 2007, 3:17am

You can sort one array, then binary search each element of the other from it. If the range is small enough, you can do histograming instead.
However, the bsearch approach is O((n+m)logn), and it’s relatively easy to do the problem in O(n+m) in serial with radix sort+merge sort. Such a naive parallelization isn’t likely to be efficient.

wildcat4096 · October 19, 2007, 3:20am

How large are the arrays? Will they always be the same size between invocations? Are the two arrays being compared the same size?

dingshuai1985 · October 19, 2007, 3:26am

It is already sorted, by the way, different size…
I am wondering if it can be done parallelization…

Also to asadafag:
When I am checking the prefix-scan, it seems that the copy part, say, from global to share and from share to global, takes most time. So can I compress the data and then copy it from global to share? That means you reduce the data you transfer and add the computation workload. Will that be faster?

thanks, guys! :)

asadafag · October 20, 2007, 3:09am

That’s possible if the compression isn’t costly. I’m planning to test a short-to-int scan some time. However, one disadvantage is that in-place scan is infeasible for compressed data, and may consume more memory in total.

dingshuai1985 · October 22, 2007, 2:11am

No one has any idea? I try the binary search and extract the same ones from a 1M array and 2M array, it takes 16ms…

asadafag · October 22, 2007, 2:42am

What’s the range of your values? Is there any special property?
If you can afford a costly precompute, you can build a hash of A on CPU or using sm11’s atomic operation. A hash look up may be better than a bsearch even with completely random access pattern… If you just use a modulo hash, the memory access pattern would even be coherent.

dingshuai1985 · November 2, 2007, 8:37pm

Help!!! I tried 3 days in debugging this little thing… <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

I use binary search to find the same things between two sorted arrays, 1M and 16M separately.

I first divide the 1M array into groups with 2K elements each group, then use the binary search to find the position of the last element in each group. Below is the stupid code :

   int tIDx = threadIdx.x;
   int bIDx = blockIdx.x;
   int BlockOff=bIDx<<8;     // 256 thread, each deals with one element
   int address0 = tIDx + BlockOff;
   int temp=d_Array[address0]; // d_Array stores the last elements of each group       
   int high;
   int low;
   unsigned int middle;
   high=HIGH;      // HIGH = 16M here
   low=LOW;        // LOW = 0 here 
       for(high=HIGH,low=LOW; low<high; )   // simple binary search here
       {                                                 // the low stops in the position which               equals to or greater than temp
           middle= (low+high)>>1;                   
           if (temp1>d_SourceArray[middle])
              low = middle + 1; 
           else 
              high = middle;              
       }                           
       d_GPos[address0]=low;             // store the position in d_GPos

The idea is simple and the code is also simple, but when I check the d_GPos array, the result shows only the first 128 threads work properly… I use 256 threads each block and it seems funny…

Help guys, 3 days in debugging this…

Thanks ^_^

dingshuai1985 · November 3, 2007, 4:29pm

Someone can help? I am very very very thankful …

trudger · November 4, 2007, 2:20pm

Why are you using both “temp” and “temp1”?

dingshuai1985 · November 4, 2007, 2:43pm

I am sorry, it is:
int tIDx = threadIdx.x;
int bIDx = blockIdx.x;
int BlockOff=bIDx<<8; // 256 thread, each deals with one element
int address0 = tIDx + BlockOff;
int temp=d_Array[address0]; // d_Array stores the last elements of each group
int high;
int low;
unsigned int middle;
high=HIGH; // HIGH = 16M here
low=LOW; // LOW = 0 here
for(high=HIGH,low=LOW; low<high; ) // simple binary search here
{ // the low stops in the position which equals to or greater than temp
middle= (low+high)>>1;
if (temp>d_SourceArray[middle])
low = middle + 1;
else
high = middle;
}
d_GPos[address0]=low; // store the position in d_GPos

It seems only the first 128 threads work… I can not figure out why :(

trudger · November 4, 2007, 5:59pm

Could you please also post how you allocate the memory and how you call the global function?

dingshuai1985 · November 4, 2007, 8:16pm

cudaMalloc((void**) &d_SourceArray, 16M);

cudaMalloc((void**) &d_Array, 0.5K);

cudaMalloc((void**) &d_GPos, 0.5K);

when I call the function, I use:

bsearch<<<16M/256, 256>>>(d_Array,d_SourceArray,d_GPos);

This is really wierd :(

trudger · November 4, 2007, 10:20pm

I’m a little confused…

You are using a 512 elements array d_Array, right? But you have 256 threads. And why are you using 16M/256 blocks?

Help!!! I tried 3 days in debugging this little thing… <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

I use binary search to find the same things between two sorted arrays, 1M and 16M separately.

I first divide the 1M array into groups with 2K elements each group, then use the binary search to find the position of the last element in each group. Below is the stupid code :

int tIDx = threadIdx.x;

int bIDx = blockIdx.x;

int BlockOff=bIDx<<8; // 256 thread, each deals with one element

int address0 = tIDx + BlockOff;

int temp=d_Array[address0]; // d_Array stores the last elements of each group

int high;

int low;

unsigned int middle;

high=HIGH; // HIGH = 16M here

low=LOW; // LOW = 0 here

for(high=HIGH,low=LOW; low<high; ) // simple binary search here

{ // the low stops in the position which equals to or greater than temp

middle= (low+high)>>1;

if (temp1>d_SourceArray[middle])

low = middle + 1;

else

high = middle;

}

d_GPos[address0]=low; // store the position in d_GPos

The idea is simple and the code is also simple, but when I check the d_GPos array, the result shows only the first 128 threads work properly… I use 256 threads each block and it seems funny…

Help guys, 3 days in debugging this…

Thanks ^_^

[snapback]274040[/snapback]

dingshuai1985 · November 5, 2007, 3:45am

Oh sorry, the call function should be

bsearch<<<0.5K/256, 256>>>(d_Array,d_SourceArray,d_GPos);

I type the wrong thing, the problem is still there…

trudger · November 5, 2007, 1:55pm

Still that question:

You are using a 512 elements array d_Array, right? But you have 256 threads. Thus you can only process half of the array?

dingshuai1985 · November 5, 2007, 3:42pm

Yes, but each block process 256 numbers and I have two blocks…

trudger · November 5, 2007, 3:51pm

According to these 2 lines:

int BlockOff=bIDx<<8; // 256 thread, each deals with one element

int address0 = tIDx + BlockOff;

It seems you are accessing element 0~255 for first block, then 1~256 for the second block. right? Is this the reason for the problem?