Performance Questions

I’ve been getting disappointing performance with this kernel. It takes around 2.3 seconds to run 256 times (256 times for 256 separate tiles. Each one contributes a piece of the final output. I chose to break it up into multiple kernels so I wouldn’t have a for loop 256x256 long. This also lets me overlap my memory copies with the kernel execution.) on an 8800GT, and I’m not sure what (if anything) is holding me back. The Profiler doesn’t work on my mac, which makes it more difficult to analyze my code. Am I making any obvious performance errors? Suggestions?

Thanks! :(

#define BLOCK_SIZE 256

#define TILE_SIZE 256

__device__ __constant__ float y_const[2500];

__device__ __constant__ float fx_const[2500];

__device__ __constant__ float fy_const[2500];

__device__ __constant__ float dy_const[2500];

//runs 256 times with a loop that iterates 256 times

//f points to a different tile of the input for each kernel call

__global__ void transform(cufftComplex* output, cufftComplex* f, unsigned int fM, 

							unsigned int fN, unsigned int fxN, unsigned int fyN, unsigned int start,

							float* x_g, float* dx_g)



	unsigned int tx = threadIdx.x;

	unsigned int bx = blockIdx.x;


	unsigned int elem = (unsigned int) (tx + (BLOCK_SIZE*bx)));

	unsigned int ifx = (unsigned int)elem % fxN;

	unsigned int ify = (unsigned int)elem / fxN;


	float fx = fx_const[ifx];

	float fy = fy_const[ify];


	__shared__ cufftComplex fTile[TILE_SIZE];


	fTile[tx] = f[tx];

	unsigned int ix, iy;

	float x, y, dx, dy, m, n, a;

	cufftComplex tot = make_cuFloatComplex(0,0);

	cufftComplex med;

	ix = start % fM;

	iy = start / fM;


	y = y_const[iy];

	dy = dy_const[iy];


	__shared__ float dx_s[TILE_SIZE];

	__shared__ float x_s[TILE_SIZE];


	dx_s[tx] = dx_g[ix + tx];

	x_s[tx] = x_g[ix + tx];




	a = y*fy;

	unsigned int counter = 0;

	#pragma unroll

	for (counter = 0; counter < TILE_SIZE; counter++){	


 		//lots of calculations

		x = x_s[counter]; 

		dx = dx_s[counter]; 

		m = -TWOPI*((x * fx) + a); 

		n = dx*dy; 

		med = cuCmulf(((cufftComplex*)fTile)[counter], make_cuFloatComplex(n * __cosf(m), n * __sinf(m))); 

		tot = cuCaddf(tot, med); 




	if (start == 0)

		output[elem] = tot;

	else output[elem] = cuCaddf(output[elem], tot);


How much time does this kernel take (GPU time in cudaprofile).
I saw that you use “%” operation, it is very slow, try not to use it.
In my experience, first of all I never use const memory, I use global memory to get the gld_incoherent valua in CUDA profile.
and try to know how to reduce the gld_incoherent by using shared memory, if can not I will use global memory or texture memory.

Would you please using global memory with (y_const, fx_const, fy_const, dy_const).
and then try copy to shared memory without gld_incoherence. if not, bind it with texture.

It seems to me that the for-loop is independent of counter, except for ‘tot’. If so, do these calculations in parallel, store the ‘tot’ results in a shared mem array and use the reduction algorithm to sum ‘tot’.
But is it quite difficult to read, please use more descriptive names i.s.o. x, dx, etc.

Thanks for the quick replies!

The profiler doesn’t work on my mac if I use any counters or other options, but I did time it with cuda events. Each kernel usually takes around 0.0088 seconds, so kernel overhead isn’t too much of a problem. This is a slightly simplified version, so I do have bit shifting instead of %. I only use it twice though, so the performance gain is small. I have also tried constant memory vs global memory, and there isn’t much of a difference unfortunately. I’ll look into texture memory though. I’m not able to look at the gld_incoherence value, but all memory accesses should be coalesced if possible (I could be wrong…).

I considered such an approach, but I figured all the extra memory accesses to global memory would make it not worth it. To get each element in the 256x256 output array, I have to perform calculations on every element in an input 256x256 array (right now, each thread computes a piece of a single element in the output array using a small tile of the input. Each successive thread in a block handles a different output element. I did it this way so threads could share an input tile copied into shared memory.). In the end, I would have even more redundant memory access than I already have, and I would still end up with some reduction over global memory :( . It does seem like something to try though. At this point, I’m not quite sure what is slowing down my program (any ideas?). I hope I understood you correctly.


Parallel access to global memory should, in principle, be much faster than sequential access. So instead of assigning one thread for each output element, try using many threads for one output element (eg. a block per element).