Reduction kernel for neural networks

timon · July 8, 2009, 3:04pm

Hi, i’m developing a kernel that calculates the state of a layer of neurons in a neural network. I want to keep the network as flexible as possible so a layer can take its inputs from different layers. Here’s the struct i use:

typedef enum {LINEAR, ANOTHER_FUNCTION} FunctionType;

typedef struct {

	int numberInputs;

	int* inputSize;

	int totalInputSize;

	float** inputNeurons;

	int outputSize;

	float* outputNeurons;

	float* weighs;

	FunctionType functionType

} Layer;

It supports different activation functions but i don’t want to focus on that by now.

I’m getting started, but i have already writen this kernel:

__device__ float func(float number, FunctionType functionType) {

	switch (functionType) {

		//TODO add different functions

		//break;

		//case ANOTHER_FUNCTION:

		//	return anotherFunction(number);

		break;

		case LINEAR:

		default:

			return number;

	}

}

__global__ void LayerCalcKernel(Layer layer)

{

	float result = 0;

	int alreadyReaded = 0;

	for (int i=0; i < layer.numberInputs; ++i){

		for (int j=0; j < layer.inputSize[i]; ++j){

			result += layer.inputNeurons[i][j] //i didn't compile it yet, so i don't know if this is allowed

				* weighs[threadIdx.x*layer.totalInputSize + alreadyReaded + j];

		}

		alreadyReaded += layer.inputSize[i];

	}

	layer.outputNeurons[threadIdx.x] = func(result);

}

In this kernel i don’t control the blocks. I was trying to imagine a way to use shared memory, but i need all the inputs to compute an output neuron.

Then i saw a .pdf explaining how to optimize reduction functions and i thought it could be just what i needed.

It can be found here.

After reading it, i found that i could write another kernel with a thread for each inputNeuron instead of one thread for every outputNeuron as i did before.

If i do that i must call the kernel for each outputNeuron in each layer.

I’m thinking how to mix both approaches, but i don’t really know how to do it.

Any suggestion will be great.

I got another question.

I want to write another version of the kernel for binary neurons (they can take the values 0 or 1). In this case i would like to use single bits instead of floats for the input and output neurons. I have used XMM assembly instructions and masks and it was fine, but the weighs were bytes insted of floats.

Is there a way to optimize the access to single bits?

In the previous code i could change this:

result += layer.inputNeurons[i][j]  * weighs[threadIdx.x*totalInputSize + alreadyReaded + j];

for this:

if (layer.inputNeurons[i].getBit (j) ) {

		result += weighs[threadIdx.x*totalInputSize + alreadyReaded + j];

}

getbit (j) would be a method of a class and C++ is not allowed here, but i just try to clarify the algorithm.

It would be also needed:

if (binaryFunc(result))

	 layer.outputNeurons[threadIdx.x].setBit();

else

	 layer.outputNeurons[threadIdx.x].resetBit();

instead of:

layer.outputNeurons[threadIdx.x] = func(result);

Any ideas?

I hope i’ve explain it well enough. I’m spanish so i’m sorry if my english isn’t correct.

Thanks in advantage.

timon · July 16, 2009, 6:21pm

hi, after reading a lot and think of it, this is my new code:

typedef struct {

	unsigned numberInputLayers;

	unsigned* inputLayerSize;

	unsigned totalInputSize;

	float** inputNeurons;

	unsigned outputSize;

	float* outputNeurons;

	float* weighs;

	FunctionType functionType

} Layer;

__global__ void LayerCalcKernel(Layer layer)

{

	unsigned connection = threadIdx.x;

	unsigned outputNeuron = connection / layer.totalInputSize;

	unsigned inputNeuron = connection % layer.totalInputSize;

	unsigned inputLayer = 0;

	char inputLayerFounded = 0;

	while (!inputLayerFounded){

		if (inputNeuron < layer.inputLayerSize[inputLayer]) {

			++inputLayerFounded;

		}

		else {

			inputNeuron -= layer.inputLayerSize[inputLayer++];

		}

	}

	__shared__ float processingArray[layer.totalInputSize];

	processingArray[connection] = layer.inputNeurons[inputLayer][inputNeuron] * layer.weighs[connection];

	__syncthreads();

	for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

		  if (connection < s) {

			  processingArray[connection] += processingArray[connection + s];

		}

		__syncthreads();

	}

	layer.outputNeurons[outputNeuron] = func(processingArray[0]);

}

i assume that the number of connections must be a power of 2.

But, what if the shared memory is not enought for :

shared float processingArray[layer.totalInputSize];

I still don’t use Blocks. I think each ouput network could be calculated within a block.

Â¿Can a block be of any size?

Maybe the reduction can be optimized.

Any comment would be appreciated. Anyway i’ll post back.

Topic		Replies	Views
tips for optimising my neural net kernel CUDA Programming and Performance	0	1311	January 20, 2010
Reduction from SDK CUDA Programming and Performance	2	11606	March 12, 2009
Dynamically restricting # of threads CUDA Programming and Performance	8	6390	August 3, 2008
Scan sample Program CUDA Programming and Performance	0	682	October 1, 2010
Reduction Problem CUDA Programming and Performance	1	3557	February 4, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1800	January 14, 2009
Best way to face this problem CUDA Programming and Performance	4	1161	May 16, 2010
thread / block allocation in function of data size CUDA Programming and Performance	5	4266	November 9, 2009
Increasing register usage without decreasing occupancy drops speed dramatically CUDA Programming and Performance	3	971	May 24, 2011
Reduction random errors Reduction kernel turns weird values CUDA Programming and Performance	2	765	February 7, 2011

Reduction kernel for neural networks

Related topics