I am moving work with neural networks to GPU, for now written simple proof of concept kernel for NN with one hidden layer (256-128-5 neurons per layer)

```
__device__ float sigmoid_symmetric(float sum)
{
return (2.0f/(1.0f + __expf(-2.0f * sum)) - 1.0f);
}
template <int totalNeurons>
__global__ void
//__launch_bounds__(128, 8)
nn_test(float *weights, float *inputs, float *outputs, int inputPitch, int outputPitch)
{
const int num_layers = 3;
const int layer_size[num_layers] = {256, 128, 5};
const int layer_index[num_layers+1] = {0, 256, 256 + 128, 256 + 128 + 5};
//const float steepness = 1;
const float max_sum = 150;///steepness;
//const float max_sum_neg = -150/steepness;
__shared__ float val [totalNeurons];
unsigned int tid = threadIdx.x;
float *locInputs = inputs + blockIdx.x * inputPitch;
float *locOutputs = outputs + blockIdx.x * outputPitch;
//input layer
for(int i = tid; i < layer_size[0]; i += blockDim.x)
{
//last neuron is the bias
val[i] = (i != layer_size[0]-1) ? locInputs[i] : 1;
}
//hidden layer(s)
float *weightsIt = weights;
for(int layer = 1; layer < num_layers; layer++)
{
__syncthreads();
if(layer > 1)
weightsIt += layer_size[layer] * layer_size[layer - 1];
if(tid < layer_size[layer])
{
int num_connections = layer_size[layer - 1];
float *weightsNeuron = weightsIt + tid;
int neuronIndex = layer_index[layer] + tid;
int neuronIndexPrev = layer_index[layer-1];
float neuron_sum = 0;
for(int con = 0; con < num_connections; con ++)
{
neuron_sum += weightsNeuron[con * num_connections] * val[neuronIndexPrev + con];
}
//neuron_sum *= 1;//steepness;
//if(neuron_sum > max_sum)
// neuron_sum = max_sum;
//else if(neuron_sum < -max_sum)
// neuron_sum = -max_sum;
neuron_sum = fmaxf(-max_sum, fminf(neuron_sum, max_sum));
val[neuronIndex] = 0;//sigmoid_symmetric(neuron_sum);
}
}
//output layer
__syncthreads();
int outputIndex = layer_index[num_layers - 1];
if(tid < layer_size[num_layers - 1])
{
locOutputs[tid] = val[outputIndex + tid];
}
}
```

I managed to handle coalesced global memory read (with 2x speed boost) but faced other issue (GTX460, SM 2.1, CUDA 4.0 RC2)

I run my kernel with 128 threads block, so each MP handles 8 blocks, 1024 threads/MP, 67% occupancy

In code above each thread use 14 registers

If i change neuron_sum = fmaxf(-max_sum, fminf(neuron_sum, max_sum)); to

```
if(neuron_sum > max_sum)
neuron_sum = max_sum;
else if(neuron_sum < -max_sum)
neuron_sum = -max_sum;
```

or uncomment sigmoid_symmetric call, register usage jumps up to 22 registers/thread and speed drops 5x times.

I read best practiced guide, boiled my brain completely but still can’t figure why.

With 22 registers/thread it uses 22K registers per MP from 32K, so there are still plenty of free registers. Occupancy doesn’t drop by calculator and profiler confirmed that it stays at 67% level.

Even if I change code to simple stupid

```
float a = neuron_sum;
val[neuronIndex] = a;//sigmoid_symmetric(neuron_sum);
```

register usage jumps from 14 to 22 and performance drops dramatically. And this little piece executed only once by each tread (5 threads execute it twice), so I believe problem lies somewhere else.

So what could be a problem here? Could you explain why it goes this way?