Hello,

In my application I would like to find the K largest elements for every vector of length N in a collection of vectors being the input:

[font=“Courier New”] float d_input[8][8][32][1024];[/font]

For every vector of length 1024 I would like to find the 64 largest values and their indices as output of the algorithm:

[font=“Courier New”] typedef struct {

```
int index;
float value;
} element_t;
element_t d_output[8][8][32][64];[/font]
```

As a starting point, I thought it could be a good idea to use 64 threads per vector in a block to work in parallel to find these values:

[codebox]dim3 grid_reduction(8*8, 32);

reduction_device<<< grid_reduction, 64 >>>( d_data, d_reduced );

**global** void reduction_device( float *input, element_t *output )

{

```
int index = (blockIdx.y * gridDim.x + blockIdx.x )* 1024 + threadIdx.x;
input += index;
output += index;
```

**shared** element_t vector[1024];

//load vector from global memory

```
for( int i = 0; i < 16; i++ )
{
vector[i*64].index= i*64;
vector[i*64].value = input[i*64];
}
```

__syncthreads();

//search for 64 largest values

__syncthreads();

//result is in first 64 locations

```
*output = vector[threadIdx.x];
```

}[/codebox]

The big question now is the algorithm. I would appreciate some help, hints or pointers to papers.

Kind regards,

peter