Hello,
In my application I would like to find the K largest elements for every vector of length N in a collection of vectors being the input:
[font=“Courier New”] float d_input[8][8][32][1024];[/font]
For every vector of length 1024 I would like to find the 64 largest values and their indices as output of the algorithm:
[font=“Courier New”] typedef struct {
int index;
float value;
} element_t;
element_t d_output[8][8][32][64];[/font]
As a starting point, I thought it could be a good idea to use 64 threads per vector in a block to work in parallel to find these values:
[codebox]dim3 grid_reduction(8*8, 32);
reduction_device<<< grid_reduction, 64 >>>( d_data, d_reduced );
global void reduction_device( float *input, element_t *output )
{
int index = (blockIdx.y * gridDim.x + blockIdx.x )* 1024 + threadIdx.x;
input += index;
output += index;
shared element_t vector[1024];
//load vector from global memory
for( int i = 0; i < 16; i++ )
{
vector[i*64].index= i*64;
vector[i*64].value = input[i*64];
}
__syncthreads();
//search for 64 largest values
__syncthreads();
//result is in first 64 locations
*output = vector[threadIdx.x];
}[/codebox]
The big question now is the algorithm. I would appreciate some help, hints or pointers to papers.
Kind regards,
peter