Hello MissSusan,

I tried to compute the sums of rows in a matrix and I think this problem is smiliar to yours.

My first try to realise your idea, would be to take a kernel configuration, which helps to avoid the search of your ids.

A basic approach is to use the register blockIdx.y as ID for a reduction ( if your IDs are less than 65536 ) to sum all elements with the accordingly ID.

That way, you can sum up 65536 x 512 Elements with every blockIdx.y row.

This is the host code an a kernel configuration, i would use…

```
...
// n is the number of elements
int n = ...
dim3 sum_DimBlock, sum_DimGrid;
// maxThreads is the maximum number of threads you want to use or your device can deliver
// we only need n / 2 threads in the first reduction circle to sum n elements
sum_DimBlock.x = ( n < maxThreads << 1 ) ? (int)ceil( n / 2.0 ) : maxThreads;
sum_DimGrid.x = (int)ceil( n / (float)( sum_DimBlock.x << 1 ));
sum_DimGrid.y = maximumID;
sums<<< sum_DimGrid, sum_DimBlock, sizeof( float ) * sum_DimBlock.x>>>( n );
while( sum_DimGrid.x > 1 )
{
n = sum_DimGrid.x;
sum_DimBlock.x = ( n < maxThreads << 1 ) ? (int)ceil( n / 2.0 ) : maxThreads;
sum_DimGrid.x = (int)ceil( n / (float)( sum_DimBlock.x << 1 ));
sums<<< sum_DimGrid, sum_DimBlock, sizeof( float ) * sum_DimBlock.x >>>( n );
}
...
```

The sums kernel can be something like:

```
__global__ void sums( int n )
{
// array in shared memory, use kernel configuration to specify the used memory
extern __shared__ float T_s[];
unsigned int ix = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int ixb = ix + gridDim.x * blockDim.x;
// load id from global memory to a register to avoid multiple global mem reads
// following ..._cPtr arrays are pointers in constant memory ( you don't need to send it with every kernel call ) that point to an array in global memory
float actualId_r = id_array_cPtr[ ix ];
unsigned int actualIndex_r = index_array_cPtr[ ix ];
float secondId_r = ( ixb < n ? id_array_cPtr[ ixb ] : -1 );
unsigned int secondIndex_r = ( ixb < n ? index_array_cPtr[ ixb ] : 0 );
// check, if the elements have the right ID
T_s[ threadIdx.x ] = ( actualId_r == blockIdx.y ? actualIndex_r : 0 ) + ( secondId_r == blockIdx.y ? secondIndex_r : 0 )
__syncthreads();
// sum all elements within the block
unsigned int s_old = blockDim.x;
for( int s = blockDim.x >> 1; s > 0; s >>= 1 )
{
if( threadIdx.x < s )
{
T_s[ threadIdx.x ] += T_s[ threadIdx.x + s ];
if( threadIdx.x == 0 && s_old != s << 1 ) // if n is not even, take the last element
{
T_s[ 0 ] += T_s[ s_old - 1 ];
}
}
s_old = s;
__syncthreads();
}
if( threadIdx.x == 0 )
{
index_array_cPtr[ blockIdx.y ] = blockIdx.y; // that is a bad idea, should not work... choose another location in the source array or use a buffer array
id_array_cPtr[ blockIdx.y ] = T_s[0];
}
}
```

I did not test this code. It is just an idea. I used a similiar one to sum matrix rows. It works for me on a tesla and a quadro device, but failed with higher dimensions on a geforce 9800gt under openSuSE 11 (this problem is still unsolved, but i think it is a driver problem).

Propably this code is not very efficient. You can try to decide to use a buffer array to avoid finding the right ids after the first kernel run ( the results are located at the “beginning of a matrix line”.

The rewrite to global memory is wrong, as I can see at the moment. But I’m out of time, sorry ^^

Good Luck