Optimisation Tips for GK110

I got a nice shiny new Titan this week, and after a bit of profiling I have noticed an overall speedup in my program (yay!), but how that speedup was split among my kernels was a bit haphazard. Some kernels gained a massive 2x speedup, yet some of my kernels have had a significant speed decrease. Is there anywhere a list of tips and tricks of optimisation techniques for ‘Big Kepler’ or is it a matter of trawling through trying to find the slow bits.

My worst offender went from about 270us to 460us so there is obviously something strange there. It is not doing anything significantly strange, a chunk of memory access, various simple maths calculations (muls, adds, subs) and thats it. There are likely to be block/thread values that give better performance but having such a drop for a kernel over a card that is significantly older (580gtx) was a bit of a surprise.

Try increasing your block size, or the overall number of blocks. When I went from the GTX 580 to the GTX 680, I had several programs that ran significantly worse because the optimal block configuration for the 580 was not good for the 680. The new SMX design with 192 CUDA cores wants a lot more threads to reach full utilization.

You’re not alone - some cublas functions have dipped in performance as well. As seibert said, you will most likely need more threads running concurrently. To add to this, if you have inter-thread communication with shared memory, experiment with the shfl() function, it can reduce shared memory usage by a factor of 32.

Seems the thread/block sizes made a huge difference to some of my kernels (but not all), and now I have hit another strange issue.

One of my kernels seems to vary in both FLOP count and execution time (104us to 87us). I had assumed the time difference was something related to card warmup (changing clockspeed state or something) as it starts off at the 105us, and switches to 85us after the first thirty or so calls. The confusing thing is that no other kernels seem to have this speed shift.

Looking at what nSight says, the kernel seems to vary in FLOP/DFLOP count (which in itself is strange unless _sincosf is not a static FLOP count), in issued IPC (and executed). The kernel when executed with the lower FLOP count takes the 100us, yet when it runs with an extra 150k FLOPs it runs at 85us.

template<int blockCount, int fastSinCos>
		__global__ void SubPixelShiftKernel512(float shiftX, float shiftY, cuComplex const* d_src, cuComplex* d_dest)
		{
			const int     tid = IMUL(512, blockIdx.x) + threadIdx.x;
			const int threadN = IMUL(blockDim.x, gridDim.x);

			float twoPiShiftY = TWOPI * shiftY;
			
			//float subj = float(blockIdx.x) / 512.0f; //This will always be below the 0.5f threshhold, unless we have 256 blocks
			float subj = __fdividef(float(blockIdx.x), 512.0f);
		#if blockCount >= 256
			#error Too high a block count provided
		#endif
			float exp_0 = float(tid & 0x00ff) / 512.0f;
			float exp_1 = exp_0 - 0.5f;
			float exp_0b;
			float exp_1b;

			
			exp_0 *=  shiftX;
			exp_1 *= shiftX;
			exp_0b = TWOPI * exp_0;
			exp_1b = TWOPI * exp_1;

			exp_0 = (subj * twoPiShiftY) + exp_0b; 
			exp_1 = (subj * twoPiShiftY) + exp_1b;

			for (int i = tid; i < (512*512); i+= (blockCount * 512))
			{

				cuComplex src = d_src[i];

				src.x /= (512.0f*512.0f);
				src.y /= (512.0f*512.0f);
				
				cuComplex shift_exp;

				if (fastSinCos == 1)
				{
					__sincosf(exp_0,&shift_exp.x, &shift_exp.y);
				}
				else
				{
					shift_exp.x = cosf(exp_0);
					shift_exp.y = sinf(exp_0);
				}

				shift_exp.y = 0 - shift_exp.y;

				float ac = shift_exp.x * src.x; 
				float bd = shift_exp.y * src.y; 
				float abcd = (shift_exp.x + shift_exp.y) * (src.x + src.y);
				shift_exp.x = ac - bd;
				shift_exp.y = abcd - ac - bd;

				d_dest[i] = shift_exp;


				src = d_src[i+256];

				src.x /= (512.0f*512.0f);
				src.y /= (512.0f*512.0f);

				if (fastSinCos == 1)
				{
					__sincosf(exp_1,&shift_exp.x, &shift_exp.y);
				}
				else
				{
					shift_exp.x = cosf(exp_1);
					shift_exp.y = sinf(exp_1);
				}

				shift_exp.y = 0 - shift_exp.y;

				ac = shift_exp.x * src.x;
				bd = shift_exp.y * src.y;
				abcd = (shift_exp.x + shift_exp.y) * (src.x + src.y);
				shift_exp.x = ac - bd;
				shift_exp.y = abcd - ac - bd;

				d_dest[i+256] = shift_exp;


				subj += (float(blockCount) / 512.0f);
				if (subj >= 0.5f)
				{
					subj -= 1.0f;
				}
				exp_0 = (subj * twoPiShiftY) + exp_0b; 
				exp_1 = (subj * twoPiShiftY) + exp_1b; 
			}
		};

I even tried removing the sin/cos to remove the possibility that it is a non-deterministic calculation yet that did not solve the strange changes.

Anyone have any ideas?