Hello everyone,

I have problems with analyzing the Speedup of a GPU implementation compared with a CPU implementation of an imageprocessing algorithm. For the same picture (1024*1024 pixel) in color and grayscale and for multiple filterwidth (size of the filter matrix), I have measured the executiontime 500 times and calculated the meantime. This I have done for multiple filters on the Jetson TK1.

In the attached picture is for example the mean-filter.

In both pictures, the x-axis is the filterwidth.

On the left side, the y-axis is the meantime and on the right the speedup.

The Speedup is calculated by CPU-meantime / GPU-meantime.

Now I have no idea, how to describe why the graph looks like it looks.

For example the color-image. Why is the highest speedup at filterwidth 7? And so on.

Can you give me a hint, what could influence the speedup?

Thanks in advance.

Here is the kernel:

```
#define R 9 // filter radius
#define D (R*2+1) // filter diameter
#define S (D*D) // filter size
#define BLOCK_W 32
#define BLOCK_H 32
#define TILE_W (BLOCK_W - (2*R))
#define TILE_H (BLOCK_H - (2*R))
__global__ void box_filter
(const unsigned char *in, unsigned char *out, const unsigned int width, const unsigned int high, const unsigned int channels){
//Indexes
const int x = blockIdx.x * TILE_W + threadIdx.x - R; // x image index
const int y = blockIdx.y * TILE_H + threadIdx.y - R; // y image index
const int d = (y * width + x) * channels + threadIdx.z; // data index
//shared mem (max 49152 bytes per block)
__shared__ float shMem[BLOCK_W][BLOCK_H][3];
if(x<0 || y<0 || x>=width || y>=high) { // Threads which are not in the picture just write 0 to the shared mem
shMem[threadIdx.x][threadIdx.y][threadIdx.z] = 0;
return;
}
shMem[threadIdx.x][threadIdx.y][threadIdx.z] = in[d];
__syncthreads();
// box filter (only for threads inside the tile)
if ((threadIdx.x >= R) && (threadIdx.x < (BLOCK_W-R)) && (threadIdx.y >= R) && (threadIdx.y < (BLOCK_H-R))) {
float sum = 0;
for(int dx=-R; dx<=R; dx++) {
for(int dy=-R; dy<=R; dy++) {
sum += shMem[threadIdx.x+dx][threadIdx.y+dy][threadIdx.z];
}
}
out[d] = sum / S;
}
}
```

**EDIT:**

I have adapt the BLOCK dimensions in order to maximize the occupancy with the occupancy-calculator. In the following table, D is the filterwidth and R the radius.

And here is the c-code from the sequential filterfunction:

```
#define R 3 // filter radius
#define D (R*2+1) // filter diameter
#define S (D*D) // filter size
void box_filter
(const unsigned char *in, unsigned char *out, const unsigned int width, const unsigned int hight, const unsigned int channels, int R){
int dx, dy, nx, ny;
unsigned int x=0, y=0, d, nidx, len=width*hight*channels;
// go over dataarray
for(d=0;d<len;d++){
float sum=0;
// go over neighbors
for(dx=-R;dx<=R;dx++){
for(dy=-R;dy<=R;dy++){
nx=x+dx;
ny=y+dy;
// check, if neighbor is in picture
if(nx>=0 && nx<width && ny>=0 && ny<hight){
nidx = d + (dy*width + dx) * channels;
sum += in[nidx];
}
}
}
out[d]=sum/S;
// calculate next x and y
if((d+1)%channels==0) x++;
if(x==width){
x=0;
y++;
}
}
}
```