Analyzing the speedup

wsm · September 1, 2017, 8:20pm

Hello everyone,

I have problems with analyzing the Speedup of a GPU implementation compared with a CPU implementation of an imageprocessing algorithm. For the same picture (1024*1024 pixel) in color and grayscale and for multiple filterwidth (size of the filter matrix), I have measured the executiontime 500 times and calculated the meantime. This I have done for multiple filters on the Jetson TK1.
In the attached picture is for example the mean-filter.
In both pictures, the x-axis is the filterwidth.
On the left side, the y-axis is the meantime and on the right the speedup.
The Speedup is calculated by CPU-meantime / GPU-meantime.

Now I have no idea, how to describe why the graph looks like it looks.
For example the color-image. Why is the highest speedup at filterwidth 7? And so on.

Can you give me a hint, what could influence the speedup?

Thanks in advance.

Here is the kernel:

#define R			9						// filter radius 
#define D			(R*2+1)					// filter diameter 
#define S			(D*D)					// filter size
#define BLOCK_W		32
#define BLOCK_H		32
#define TILE_W		(BLOCK_W - (2*R))
#define TILE_H		(BLOCK_H - (2*R))

__global__ void box_filter
(const unsigned char *in, unsigned char *out, const unsigned int width, const unsigned int high, const unsigned int channels){
	//Indexes
	const int x = blockIdx.x * TILE_W + threadIdx.x - R;		// x image index
	const int y = blockIdx.y * TILE_H + threadIdx.y - R;		// y image index
	const int d = (y * width + x) * channels + threadIdx.z;		// data index
	
	//shared mem (max 49152 bytes per block)
	__shared__ float shMem[BLOCK_W][BLOCK_H][3];
	if(x<0 || y<0 || x>=width || y>=high) {			// Threads which are not in the picture just write 0 to the shared mem
		shMem[threadIdx.x][threadIdx.y][threadIdx.z] = 0;
		return;	
	}
	shMem[threadIdx.x][threadIdx.y][threadIdx.z] = in[d];
	__syncthreads();
		
	// box filter (only for threads inside the tile)
	if ((threadIdx.x >= R) && (threadIdx.x < (BLOCK_W-R)) && (threadIdx.y >= R) && (threadIdx.y < (BLOCK_H-R))) {
		float sum = 0;
			for(int dx=-R; dx<=R; dx++) {
				for(int dy=-R; dy<=R; dy++) {
					sum += shMem[threadIdx.x+dx][threadIdx.y+dy][threadIdx.z];
				}
			}
		out[d] = sum / S;	
	}
}

EDIT:

I have adapt the BLOCK dimensions in order to maximize the occupancy with the occupancy-calculator. In the following table, D is the filterwidth and R the radius.

And here is the c-code from the sequential filterfunction:

#define R 			3						// filter radius 
#define D			(R*2+1)					// filter diameter 
#define S			(D*D)					// filter size

void box_filter
(const unsigned char *in, unsigned char *out, const unsigned int width, const unsigned int hight, const unsigned int channels, int R){
	int dx, dy, nx, ny;
	unsigned int x=0, y=0, d, nidx, len=width*hight*channels;
	// go over dataarray
	for(d=0;d<len;d++){
		float sum=0;
		// go over neighbors
		for(dx=-R;dx<=R;dx++){
			for(dy=-R;dy<=R;dy++){
				nx=x+dx;
				ny=y+dy;
				// check, if neighbor is in picture
				if(nx>=0 && nx<width && ny>=0 && ny<hight){
					nidx = d + (dy*width + dx) * channels;
					sum += in[nidx];
				}
				
			}
		}
		
		out[d]=sum/S;
		
		// calculate next x and y
		if((d+1)%channels==0)	x++;
		if(x==width){
			x=0;
			y++;
		}
	}
}

Robert_Crovella · September 1, 2017, 9:38pm

The cost to load your 32x32 shared mem array is constant per block or per tile.

If your filterwidth is very small, you get a lot of threads within your tile that can take advantage of shared memory load, but the benefit of reuse is small (the number of times each item is reused is small) because the filterwidth is small.

If your filterwidth is very large, then the tile size drops, so the number of threads per tile that benefit from the loading of the 32x32 shared mem array is smaller and smaller.

I think it makes sense there would be a sweet spot in the middle.

If your x-axis is your filter radius, then your datapoint at R=17 is curious. R=17 would give a negative tile size:

#define BLOCK_W		32
#define BLOCK_H		32
#define TILE_W		(BLOCK_W - (2*R))

If you seriously want help, my suggestion would be to provide a complete code that others can test. Providing the kernel without the CPU reference you are comparing to is only half the story (or less).

wsm · September 2, 2017, 11:24am

I have updated my question

wsm · October 16, 2017, 6:21pm

Hello,
now I have another Question about the same Kernel. In many of my Implementation the deviation is growing with the Filter size. For a explanation I have thought about a running-example:
In a marathon, the deviation is grater than in a 20m sprint and higher filter-size takes more time.
But how can I write that in a more scientific way? I have searched for explanations in papers or books, but found nothing.

Here are the histograms of the box-filter:

On the x-Axis is time in ms and on the y-axis the Number of values in an Interval of 5ms.
In the headlines are the filter-size, the standard deviation and the occupancy calculated with the occupancy-calculator. Not only filters with lower occupancy are higher deviated.

best regards

njuffa · October 16, 2017, 8:04pm

What happens if you express the run-time deviation as a percentage of the run time, instead of in absolute numbers?

If you run a sufficient number of experiments, that would presumably result in almost identically looking Gaussian distributions, in which case, no further explanation necessary, correct? The total spread seems to be roughly +/-25% in all the cases, except possibly for the very last one. Scaling the diagram’s x-axis limits in strict proportion to the average run time should get you the same effect.

cbuchner1 · October 17, 2017, 8:45am

Also careful with the high speed-ups. This is a against a single threaded, non vectorized CPU implementation. If you used all 4 cores with 128 bit SIMD NEON instructions to implement the CPU filter, the picture might look a lot different.

wsm · October 25, 2017, 8:31am

The standard deviation and percentual deviation dont gave me any useful information, so I decided to remove it from the figures. In many of my plots, the time distributions are much wider at higher filtersizes and I am looking for a reason or explanation for that. An example histogramplot is below. The time-intervals are always 5ms. The red line is the mean and the green one the median of the measurements.

Topic		Replies	Views
Image processing with CUDA: design question. CUDA Programming and Performance	5	1018	January 26, 2018
Please criticize code (Gaussian filtering) CUDA Programming and Performance	4	1000	March 14, 2016
Kernel Konfiguration and Runtime CUDA Programming and Performance	6	2455	March 6, 2010
How to use more efficiently the shared memory and 2D tiles CUDA Programming and Performance	6	1695	May 18, 2023
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
[RayTracing] Smaller resolution - smaller rays per sec (performance) CUDA Programming and Performance	13	2119	October 30, 2013
Question on --use_fast_math speedup and register usage CUDA Programming and Performance	12	3989	October 18, 2013
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	817	March 30, 2024
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1032	October 16, 2023
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3482	November 25, 2008

Analyzing the speedup

Related topics