Hi,

I need to compute a very large vector median and I would to use shared memory. any idea for an efficient operation splitting?

Thanks in advance.

i dont know what is a large vector median

but i think you can find good thing here http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

Sorry, my topic was not clear enough.

I’ll try to explain my problem better:

I have a matrix num_rxnum_c and I need to calculate the median of its columns. I would to use the shared memory, but if matrix dimensions are bigger than the maximum number of thread x block, I need to split the operation, and I’ve decided to keep block dimensions fixed in order to achieve better performance. I’ve done this also for calculating the mean value for each column. This This is my code

#define F_BLOCK_SIZE 32

…

num_r= Wr*Wc

if((num_r%F_BLOCK_SIZE)!=0)

a=((num_r)/F_BLOCK_SIZE)+1;

else

a=((num_r)/F_BLOCK_SIZE);

…

//compute Execution Configuration

```
dimBlock.y=F_BLOCK_SIZE;
dimBlock.x=1;
dimGrid.x=a;
dimGrid.y=num_c;
```

cudaEventRecord(start, 0);

```
compute_sub_mean<<<dimGrid,dimBlock>>>(d_matrix,d_sub_mean,num_c,pos,Wr,Wc);
```

dimBlock.x=1;

```
dimBlock.y=F_BLOCK_SIZE;
dimGrid.y=num_c/F_BLOCK_SIZE;
dimGrid.x=1;
```

if((Nroi%F_BLOCK_SIZE)!=0)

```
dimGrid.y=dimGrid.y+1;
compute_mean<<<dimGrid,dimBlock>>>(d_sub_mean,d_feat,d_mean,a,Wr,Wc,num_c,pos,tot_feat);
......
```

/*Kernel to compute sub mean*/

**global** void compute_sub_mean(short *in_w,float *sub_mean,int Nr,int pos,int wr,int wc)

{

int ty=threadIdx.y;

int bx=blockIdx.x;

int by=blockIdx.y;

float win_area=(wr*wc);

int index=(ty+bx*blockDim.y)+by*win_area;

**shared** float partial_wind_s[SH_SIZE];

//init shared memory

partial_wind_s[ty]=0;

//Copy windows in shared memory

if(((ty+bx*blockDim.y))<win_area)

partial_wind_s[ty]=in_w[index];

__syncthreads();

//calculate partial mean in shared memory

for(int stride=blockDim.y/2; stride>=1; stride>>=1)

{

if(ty<stride)

partial_wind_s[ty]+=partial_wind_s[ty+stride];

__syncthreads();

}

//write partial mean in global memory

sub_mean[bx*Nr+by]=partial_wind_s[0];

}

/*Kernel to compute_mean*/

**global** void compute_mean(float *s_mean,float *feat,float *me,int num_sub_mean,int wr,int wc,int Nr,int pos,int tot_feat)

{ int ty=threadIdx.y;

```
int by=blockIdx.y;
float sum=0;
float win_area=wr*wc;
```

//claculate mean from partial mean (global-to-global)

```
for(int i=0;i<num_sub_mean; i++)
{sum+=s_mean[ty+by*blockDim.y+i*Nr];
}
```

//write results in global memory (both in features matrix and in means vector)

if((ty+by*blockDim.y)<Nr)

{ feat[pos+(ty+by*blockDim.y)*tot_feat]=sum/win_area;

```
me[(ty+by*blockDim.y)]=sum/win_area;
```

}

}

Can I do the same thing for the median? I know that I need to use a sorting algorithm, and I’ve chose merge_sort.

Any Idea?

Hi,

If your big matrice fits in global memory:

If you have many reductions to perforom you will have more parallelism doing it concurently:

**global** void means(float *BigMatriceIn, float *LongVectorOut, sizes …)
{
tid=threadIdx.x+blockIdx.x*blockDim.x;

float sum=0;

for(int i=0;i<LongSize;i++{

sum+=BigMatriceIn[i*LongSize+tid];

}

LongVectorOut[tid]=sum;

}

This is better than doing many reduction (lower occupancy).

If the size of the vector is too long ( > 65535 x 1024) you can

- make more computation / thread
- or call many time the kernel

For the median it will be very difficult if your matrice doesn t fit in global mermory,

and if it fits you have to do more computation by threads.