Sum of a subvector

Hi everybody,

I want to do a sum over some elements of a vector (power of two an sorted previously), and write this sum in other vector. The number of elements to sum in each case it’s not known but there are values that match a condition (are pairs index-value). Let me put a simple example:

in vector have pairs (index, values) and contain this information for several points (each of size 4 in this example). So I need to compute the sum of each subvector for every point if the index _i math. Subvector are sorted by index using bitonic sort very fast.

in vector: [ (1, 0.1) (1, 0.5) (1, 0.2) (3, 0.2) - (1, 0.2) (1, 0.3) (3, 0.2) (3, 0.2) ]
out vector: [ (1, 0.8) (3, 0.3) (INT_MAX,0) (INT_MAX, 0) - (1, 0.5) (3, 0.4) (INT_MAX,0) (INT_MAX, 0) ] // same length

Now, I’m launching as many threads as points I have (in the example 2 threads) that look for the index and write the result, but I want to make this more efficiently. Do you have some ideas to put in practice?

Thanks in advance

Look in the scan and reduction samples in the SDK.

Its amazing how common this thing is and how every new CUDA user asks the same question in

the forums (including myself :) a year ago ).

Maybe nVidia should put a sticky post saying - look at reduction and scan samples in the SDK ;)

eyal

thank you very much, I’ll look at the examples and sorry for that but I’m new :unsure:

Nothing to be sorry at - I guess most people here gone that path :)

Enjoy CUDA ! :)

eyal

I’ve looking at the examples, but I don’t uderstand some thing and how to adapt it to my particular problem…if you can help I would be very glad.

The question is, suppose I take this code from reduction example

__global__ void reduction(float *g_idata, float *g_odata)

{

	SharedMemory< float  > smem;

	float *sdata = smem.getPointer();

	// load shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];

	__syncthreads();

	// do reduction in shared mem

	for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

	{

		if (tid < s) 

		{

			sdata[tid] += sdata[tid + s];

		}

		__syncthreads();

	}

	// write result for this block to global mem

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

it’s obviously that it computes the sum of all values for each index isn’t it? But I want to sum only those values that meet the condition, I mean, I work with pairs (float2) so I want to sum A and B if A.x == B.x, and in this case store the result A.y + B.y in global memory. How can I adapt these). Something like this but it didn’t work:

__global__ void reduction_mine(float2 *g_idata, float2 *g_odata)

{

	SharedMemory< float2  > smem;

	float2 *sdata = smem.getPointer();

	// load shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];

	__syncthreads();

	// do reduction in shared mem

	for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

	{

		if (tid < s) 

		{

			if (sdata[tid].x == sdata[tid + s].x) 

			sdata[tid] += sdata[tid + s];

		}

		__syncthreads();

	}

	// write result for this block to global mem

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

I’ve looking at the examples, but I don’t uderstand some thing and how to adapt it to my particular problem…if you can help I would be very glad.

The question is, suppose I take this code from reduction example

__global__ void reduction(float *g_idata, float *g_odata)

{

	SharedMemory< float  > smem;

	float *sdata = smem.getPointer();

	// load shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];

	__syncthreads();

	// do reduction in shared mem

	for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

	{

		if (tid < s) 

		{

			sdata[tid] += sdata[tid + s];

		}

		__syncthreads();

	}

	// write result for this block to global mem

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

it’s obviously that it computes the sum of all values for each index isn’t it? But I want to sum only those values that meet the condition, I mean, I work with pairs (float2) so I want to sum A and B if A.x == B.x, and in this case store the result A.y + B.y in global memory. How can I adapt these). Something like this but it didn’t work:

__global__ void reduction_mine(<b>float2 </b>*g_idata, <b>float2 </b>*g_odata)

{

	SharedMemory< <b>float2  </b>> smem;

	<b>float2 </b>*sdata = smem.getPointer();

	// load shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	sdata[tid] = g_idata[i];

	__syncthreads();

	// do reduction in shared mem

	for(unsigned int s=blockDim.x/2; s>0; s>>=1) 

	{

		if (tid < s) 

		{

			<b>if (sdata[tid].x == sdata[tid + s].x) </b>

					 sdata[tid] += sdata[tid + s];

		}

		__syncthreads();

	}

	// write result for this block to global mem

	if (tid == 0) g_odata[blockIdx.x] = sdata[0];

}

Thanks you in advance!

Are you sure you’re using shared memory?

Instead of :

if (sdata[tid].x == sdata[tid + s].x)

   sdata[tid] += sdata[tid + s];

try to do this:

if (sdata[tid].x == sdata[tid + s].x)

   sdata[tid] += sdata[tid + s];

else

  sdata[tid] = 0;  // I guess you should zero it since you dont want it to go into the summation.

what do you think?

yes, I think you’re right, although the sum is in .y…

if (sdata[tid].x == sdata[tid + s].x)

   sdata[tid].y += sdata[tid + s].y;

else

  sdata[tid].y = 0;  // I guess you should zero it since you dont want it to go into the summation.

So, this kernel computes the sum of requires values for one position of the output vector (blockIdx.x), and I need to launch as many blocks as number of elements in my vector? or as many threads as number elements?

thank you