Reduction kernel for neural networks

Hi, i’m developing a kernel that calculates the state of a layer of neurons in a neural network. I want to keep the network as flexible as possible so a layer can take its inputs from different layers. Here’s the struct i use:

typedef enum {LINEAR, ANOTHER_FUNCTION} FunctionType;

typedef struct {

	int numberInputs;

	int* inputSize;

	int totalInputSize;

	float** inputNeurons;

	int outputSize;

	float* outputNeurons;

	float* weighs;

	FunctionType functionType

} Layer;

It supports different activation functions but i don’t want to focus on that by now.

I’m getting started, but i have already writen this kernel:

__device__ float func(float number, FunctionType functionType) {

	switch (functionType) {

		//TODO add different functions



		//	return anotherFunction(number);


		case LINEAR:


			return number;



__global__ void LayerCalcKernel(Layer layer)


	float result = 0;

	int alreadyReaded = 0;

	for (int i=0; i < layer.numberInputs; ++i){

		for (int j=0; j < layer.inputSize[i]; ++j){

			result += layer.inputNeurons[i][j] //i didn't compile it yet, so i don't know if this is allowed

				* weighs[threadIdx.x*layer.totalInputSize + alreadyReaded + j];


		alreadyReaded += layer.inputSize[i];


	layer.outputNeurons[threadIdx.x] = func(result);


In this kernel i don’t control the blocks. I was trying to imagine a way to use shared memory, but i need all the inputs to compute an output neuron.

Then i saw a .pdf explaining how to optimize reduction functions and i thought it could be just what i needed.

It can be found here.

After reading it, i found that i could write another kernel with a thread for each inputNeuron instead of one thread for every outputNeuron as i did before.

If i do that i must call the kernel for each outputNeuron in each layer.

I’m thinking how to mix both approaches, but i don’t really know how to do it.

Any suggestion will be great.

I got another question.

I want to write another version of the kernel for binary neurons (they can take the values 0 or 1). In this case i would like to use single bits instead of floats for the input and output neurons. I have used XMM assembly instructions and masks and it was fine, but the weighs were bytes insted of floats.

Is there a way to optimize the access to single bits?

In the previous code i could change this:

result += layer.inputNeurons[i][j]  * weighs[threadIdx.x*totalInputSize + alreadyReaded + j];

for this:

if (layer.inputNeurons[i].getBit (j) ) {

		result += weighs[threadIdx.x*totalInputSize + alreadyReaded + j];


getbit (j) would be a method of a class and C++ is not allowed here, but i just try to clarify the algorithm.

It would be also needed:

if (binaryFunc(result))




instead of:

layer.outputNeurons[threadIdx.x] = func(result);

Any ideas?

I hope i’ve explain it well enough. I’m spanish so i’m sorry if my english isn’t correct.

Thanks in advantage.

hi, after reading a lot and think of it, this is my new code:

typedef struct {

	unsigned numberInputLayers;

	unsigned* inputLayerSize;

	unsigned totalInputSize;

	float** inputNeurons;

	unsigned outputSize;

	float* outputNeurons;

	float* weighs;

	FunctionType functionType

} Layer;

__global__ void LayerCalcKernel(Layer layer)


	unsigned connection = threadIdx.x;

	unsigned outputNeuron = connection / layer.totalInputSize;

	unsigned inputNeuron = connection % layer.totalInputSize;

	unsigned inputLayer = 0;

	char inputLayerFounded = 0;

	while (!inputLayerFounded){

		if (inputNeuron < layer.inputLayerSize[inputLayer]) {



		else {

			inputNeuron -= layer.inputLayerSize[inputLayer++];



	__shared__ float processingArray[layer.totalInputSize];

	processingArray[connection] = layer.inputNeurons[inputLayer][inputNeuron] * layer.weighs[connection];


	for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

		  if (connection < s) {

			  processingArray[connection] += processingArray[connection + s];




	layer.outputNeurons[outputNeuron] = func(processingArray[0]);


i assume that the number of connections must be a power of 2.

But, what if the shared memory is not enought for :

shared float processingArray[layer.totalInputSize];

I still don’t use Blocks. I think each ouput network could be calculated within a block.

¿Can a block be of any size?

Maybe the reduction can be optimized.

Any comment would be appreciated. Anyway i’ll post back.