Analyzing the speedup

Hello everyone,

I have problems with analyzing the Speedup of a GPU implementation compared with a CPU implementation of an imageprocessing algorithm. For the same picture (1024*1024 pixel) in color and grayscale and for multiple filterwidth (size of the filter matrix), I have measured the executiontime 500 times and calculated the meantime. This I have done for multiple filters on the Jetson TK1.
In the attached picture is for example the mean-filter.
In both pictures, the x-axis is the filterwidth.
On the left side, the y-axis is the meantime and on the right the speedup.
The Speedup is calculated by CPU-meantime / GPU-meantime.

Now I have no idea, how to describe why the graph looks like it looks.
For example the color-image. Why is the highest speedup at filterwidth 7? And so on.

Can you give me a hint, what could influence the speedup?

Thanks in advance.

Here is the kernel:

#define R			9						// filter radius 
#define D			(R*2+1)					// filter diameter 
#define S			(D*D)					// filter size
#define BLOCK_W		32
#define BLOCK_H		32
#define TILE_W		(BLOCK_W - (2*R))
#define TILE_H		(BLOCK_H - (2*R))

__global__ void box_filter
(const unsigned char *in, unsigned char *out, const unsigned int width, const unsigned int high, const unsigned int channels){
	const int x = blockIdx.x * TILE_W + threadIdx.x - R;		// x image index
	const int y = blockIdx.y * TILE_H + threadIdx.y - R;		// y image index
	const int d = (y * width + x) * channels + threadIdx.z;		// data index
	//shared mem (max 49152 bytes per block)
	__shared__ float shMem[BLOCK_W][BLOCK_H][3];
	if(x<0 || y<0 || x>=width || y>=high) {			// Threads which are not in the picture just write 0 to the shared mem
		shMem[threadIdx.x][threadIdx.y][threadIdx.z] = 0;
	shMem[threadIdx.x][threadIdx.y][threadIdx.z] = in[d];
	// box filter (only for threads inside the tile)
	if ((threadIdx.x >= R) && (threadIdx.x < (BLOCK_W-R)) && (threadIdx.y >= R) && (threadIdx.y < (BLOCK_H-R))) {
		float sum = 0;
			for(int dx=-R; dx<=R; dx++) {
				for(int dy=-R; dy<=R; dy++) {
					sum += shMem[threadIdx.x+dx][threadIdx.y+dy][threadIdx.z];
		out[d] = sum / S;	


I have adapt the BLOCK dimensions in order to maximize the occupancy with the occupancy-calculator. In the following table, D is the filterwidth and R the radius.

And here is the c-code from the sequential filterfunction:

#define R 			3						// filter radius 
#define D			(R*2+1)					// filter diameter 
#define S			(D*D)					// filter size

void box_filter
(const unsigned char *in, unsigned char *out, const unsigned int width, const unsigned int hight, const unsigned int channels, int R){
	int dx, dy, nx, ny;
	unsigned int x=0, y=0, d, nidx, len=width*hight*channels;
	// go over dataarray
		float sum=0;
		// go over neighbors
				// check, if neighbor is in picture
				if(nx>=0 && nx<width && ny>=0 && ny<hight){
					nidx = d + (dy*width + dx) * channels;
					sum += in[nidx];
		// calculate next x and y
		if((d+1)%channels==0)	x++;

The cost to load your 32x32 shared mem array is constant per block or per tile.

If your filterwidth is very small, you get a lot of threads within your tile that can take advantage of shared memory load, but the benefit of reuse is small (the number of times each item is reused is small) because the filterwidth is small.

If your filterwidth is very large, then the tile size drops, so the number of threads per tile that benefit from the loading of the 32x32 shared mem array is smaller and smaller.

I think it makes sense there would be a sweet spot in the middle.

If your x-axis is your filter radius, then your datapoint at R=17 is curious. R=17 would give a negative tile size:

#define BLOCK_W		32
#define BLOCK_H		32
#define TILE_W		(BLOCK_W - (2*R))

If you seriously want help, my suggestion would be to provide a complete code that others can test. Providing the kernel without the CPU reference you are comparing to is only half the story (or less).

I have updated my question

now I have another Question about the same Kernel. In many of my Implementation the deviation is growing with the Filter size. For a explanation I have thought about a running-example:
In a marathon, the deviation is grater than in a 20m sprint and higher filter-size takes more time.
But how can I write that in a more scientific way? I have searched for explanations in papers or books, but found nothing.

Here are the histograms of the box-filter:

On the x-Axis is time in ms and on the y-axis the Number of values in an Interval of 5ms.
In the headlines are the filter-size, the standard deviation and the occupancy calculated with the occupancy-calculator. Not only filters with lower occupancy are higher deviated.

best regards

What happens if you express the run-time deviation as a percentage of the run time, instead of in absolute numbers?

If you run a sufficient number of experiments, that would presumably result in almost identically looking Gaussian distributions, in which case, no further explanation necessary, correct? The total spread seems to be roughly +/-25% in all the cases, except possibly for the very last one. Scaling the diagram’s x-axis limits in strict proportion to the average run time should get you the same effect.

Also careful with the high speed-ups. This is a against a single threaded, non vectorized CPU implementation. If you used all 4 cores with 128 bit SIMD NEON instructions to implement the CPU filter, the picture might look a lot different.

The standard deviation and percentual deviation dont gave me any useful information, so I decided to remove it from the figures. In many of my plots, the time distributions are much wider at higher filtersizes and I am looking for a reason or explanation for that. An example histogramplot is below. The time-intervals are always 5ms. The red line is the mean and the green one the median of the measurements.