optimization of a special double loop

Hi everybody,

First of all, thank you for all the information you provide in these forums, and sorry for my english too ;-)

My problem is simple. I need to scan two vector (sorted) and check if the sum of every index is equal to the value i’m looking for. Something like this:

int k = 0:tam1+tam2;

for (i=0; i<tam1; i++)

   if (i < k) {

   for (j=0, j<tam2; j++)

	  if ( i + j = k ) [store the result]

	  else if ( i + j > k) break;		  // vectors are sorted <=

   else k++;

The question is how to do this efficiently. The size of my vectors suppose that is 1024, so for example I could divide 2048/64, each one of the 64 threads launched to cover the whole vector scan their part th0: 0 to 32, th1: 33 to 64, etc so treadIdx.x would be the k variable. Each kernel do a double loop to find the values. And also, I want to do this with several points, so and apropiate grid could be (1, num_points) and the block (64). Something like I show next ( simplified ).

__global__ void scan( params ) {

	int k = threadIdx.x*size_part;			

	int point = blockIdx.y;				

	int id_th = threadIdx.x;


		int i1 = 0;	

	while ( !end ) {

		if ( vector1[i1 + point*tam] < k ) {			  // the values por every point are split tam by tam - it's the jump

			int i2 = abs(k - vector1[i1  + point*tam]);

						int i3 = i2;

			while ( !find ) {

				if (vector2[i3  + point*tam] == i2)	{ [store result]; 	find = true; }

				else if (vector2[i3  + point*tam] < i2 ) { break; i1++; }





		else { i1 = 0;  k = k + 1; }

		end = (k == (id_th + 1)*size_part);


I have done this but it still slow, and I think there must be something more efficient to do this. If you understand me and can help me, I would be very glad.

Thank you.

You should look in the SDK for reduction and scan samples.