Parallel median computation

francy300485 · February 11, 2011, 11:42am

Hi,
I need to compute a very large vector median and I would to use shared memory. any idea for an efficient operation splitting?
Thanks in advance.

cricri1 · February 11, 2011, 1:23pm

i dont know what is a large vector median
but i think you can find good thing here http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

francy300485 · February 11, 2011, 2:01pm

Sorry, my topic was not clear enough.

I’ll try to explain my problem better:

I have a matrix num_rxnum_c and I need to calculate the median of its columns. I would to use the shared memory, but if matrix dimensions are bigger than the maximum number of thread x block, I need to split the operation, and I’ve decided to keep block dimensions fixed in order to achieve better performance. I’ve done this also for calculating the mean value for each column. This This is my code

define F_BLOCK_SIZE 32

…

num_r= Wr*Wc

if((num_r%F_BLOCK_SIZE)!=0)

a=((num_r)/F_BLOCK_SIZE)+1;

else

a=((num_r)/F_BLOCK_SIZE);

…

//compute Execution Configuration

dimBlock.y=F_BLOCK_SIZE;

dimBlock.x=1;

dimGrid.x=a;

dimGrid.y=num_c;

cudaEventRecord(start, 0);

compute_sub_mean<<<dimGrid,dimBlock>>>(d_matrix,d_sub_mean,num_c,pos,Wr,Wc);

dimBlock.x=1;

dimBlock.y=F_BLOCK_SIZE;

dimGrid.y=num_c/F_BLOCK_SIZE;

dimGrid.x=1;

if((Nroi%F_BLOCK_SIZE)!=0)

dimGrid.y=dimGrid.y+1;

compute_mean<<<dimGrid,dimBlock>>>(d_sub_mean,d_feat,d_mean,a,Wr,Wc,num_c,pos,tot_feat);

......

/Kernel to compute sub mean/

global void compute_sub_mean(short *in_w,float *sub_mean,int Nr,int pos,int wr,int wc)

{

int ty=threadIdx.y;

int bx=blockIdx.x;

int by=blockIdx.y;

float win_area=(wr*wc);

int index=(ty+bxblockDim.y)+bywin_area;

shared float partial_wind_s[SH_SIZE];

//init shared memory

partial_wind_s[ty]=0;

//Copy windows in shared memory

if(((ty+bx*blockDim.y))<win_area)

partial_wind_s[ty]=in_w[index];

__syncthreads();

//calculate partial mean in shared memory

for(int stride=blockDim.y/2; stride>=1; stride>>=1)

{

if(ty<stride)

partial_wind_s[ty]+=partial_wind_s[ty+stride];

__syncthreads();

}

//write partial mean in global memory

sub_mean[bx*Nr+by]=partial_wind_s[0];

}

/Kernel to compute_mean/

global void compute_mean(float *s_mean,float *feat,float *me,int num_sub_mean,int wr,int wc,int Nr,int pos,int tot_feat)

{ int ty=threadIdx.y;

int by=blockIdx.y;

float sum=0;

float win_area=wr*wc;

//claculate mean from partial mean (global-to-global)

for(int i=0;i<num_sub_mean; i++)

 {sum+=s_mean[ty+by*blockDim.y+i*Nr];

 }

//write results in global memory (both in features matrix and in means vector)

if((ty+by*blockDim.y)<Nr)

{ feat[pos+(ty+by*blockDim.y)*tot_feat]=sum/win_area;

 me[(ty+by*blockDim.y)]=sum/win_area;

}

Can I do the same thing for the median? I know that I need to use a sorting algorithm, and I’ve chose merge_sort.

Any Idea?

varslan · February 11, 2011, 5:00pm

Hi,

If your big matrice fits in global memory:
If you have many reductions to perforom you will have more parallelism doing it concurently:

global void means(float *BigMatriceIn, float LongVectorOut, sizes …)
{
tid=threadIdx.x+blockIdx.xblockDim.x;

float sum=0;
for(int i=0;i<LongSize;i++{
sum+=BigMatriceIn[i*LongSize+tid];
}
LongVectorOut[tid]=sum;
}
This is better than doing many reduction (lower occupancy).

If the size of the vector is too long ( > 65535 x 1024) you can

make more computation / thread
or call many time the kernel

For the median it will be very difficult if your matrice doesn t fit in global mermory,
and if it fits you have to do more computation by threads.