Increasing register usage without decreasing occupancy drops speed dramatically

I am moving work with neural networks to GPU, for now written simple proof of concept kernel for NN with one hidden layer (256-128-5 neurons per layer)

__device__ float sigmoid_symmetric(float sum)

{

	return (2.0f/(1.0f + __expf(-2.0f * sum)) - 1.0f);

}

template <int totalNeurons>

__global__ void 

//__launch_bounds__(128, 8)

nn_test(float *weights, float *inputs, float *outputs, int inputPitch, int outputPitch)

{

	const int num_layers = 3;

	const int layer_size[num_layers] = {256, 128, 5};

	const int layer_index[num_layers+1] = {0, 256, 256 + 128, 256 + 128 + 5};

	//const float steepness = 1;

	const float max_sum = 150;///steepness;

	//const float max_sum_neg = -150/steepness;

	__shared__ float val [totalNeurons];

unsigned int tid = threadIdx.x;

	float *locInputs = inputs + blockIdx.x * inputPitch;

	float *locOutputs = outputs + blockIdx.x * outputPitch;

	//input layer

	for(int i = tid; i < layer_size[0]; i += blockDim.x)

	{

		//last neuron is the bias

		val[i] = (i != layer_size[0]-1) ? locInputs[i] : 1;

	}

	

	//hidden layer(s)

	float *weightsIt = weights;

	for(int layer = 1; layer < num_layers; layer++)

	{

		__syncthreads();

		if(layer > 1)

			weightsIt += layer_size[layer] * layer_size[layer - 1];

		if(tid < layer_size[layer])

		{

			int num_connections = layer_size[layer - 1];

			float *weightsNeuron = weightsIt + tid;

			int neuronIndex = layer_index[layer] + tid;

			int neuronIndexPrev = layer_index[layer-1];

		

			float neuron_sum = 0;

			for(int con = 0; con < num_connections; con ++)

			{

				neuron_sum += weightsNeuron[con * num_connections] * val[neuronIndexPrev + con];

			}

			//neuron_sum *= 1;//steepness;

			//if(neuron_sum > max_sum)

			//	neuron_sum = max_sum;

			//else if(neuron_sum < -max_sum)

			//	neuron_sum = -max_sum;

			neuron_sum = fmaxf(-max_sum, fminf(neuron_sum, max_sum));

			val[neuronIndex] = 0;//sigmoid_symmetric(neuron_sum);

		}

	}

	//output layer

	__syncthreads();

	int outputIndex = layer_index[num_layers - 1];

	if(tid < layer_size[num_layers - 1])

	{

		locOutputs[tid] = val[outputIndex + tid];

	}

}

I managed to handle coalesced global memory read (with 2x speed boost) but faced other issue (GTX460, SM 2.1, CUDA 4.0 RC2)

I run my kernel with 128 threads block, so each MP handles 8 blocks, 1024 threads/MP, 67% occupancy

In code above each thread use 14 registers

If i change neuron_sum = fmaxf(-max_sum, fminf(neuron_sum, max_sum)); to

if(neuron_sum > max_sum)

				neuron_sum = max_sum;

			else if(neuron_sum < -max_sum)

				neuron_sum = -max_sum;

or uncomment sigmoid_symmetric call, register usage jumps up to 22 registers/thread and speed drops 5x times.

I read best practiced guide, boiled my brain completely but still can’t figure why.

With 22 registers/thread it uses 22K registers per MP from 32K, so there are still plenty of free registers. Occupancy doesn’t drop by calculator and profiler confirmed that it stays at 67% level.

Even if I change code to simple stupid

float a = neuron_sum;

			val[neuronIndex] = a;//sigmoid_symmetric(neuron_sum);

register usage jumps from 14 to 22 and performance drops dramatically. And this little piece executed only once by each tread (5 threads execute it twice), so I believe problem lies somewhere else.

So what could be a problem here? Could you explain why it goes this way?

The result in [font=“Courier New”]neuron_sum[/font] isn’t actually used anywhere in your code. So the compiler optimizes away the whole calculation, which saves both registers and execution time.

If you save the result in[font=“Courier New”] val[/font], the compiler cannot optimize away the calculation anymore. With the more complicated [font=“Courier New”]if … else if …[/font] conditional the compiler still could, but probably isn’t able to deduce this is a safe optimization.

In short: If you want the results from the calculation, you will have to wait.

It seems so
I forgot that it does optimizations even in debug mode :dry:
Shame on me :)

Anyway I got the picture.
Now it’s faster than 1 CPU core (i5 2500) when I run 16+ blocks. With more blocks the difference could be up to 4x and kernel is compute bound.
Probably I can double number of threads to decrease block processing time.

But learning will be a different story. My task prohibit using batch NN training so only good old incremental back propagation, and this part it seems will stick to CPU