Hi, i’m developing a kernel that calculates the state of a layer of neurons in a neural network. I want to keep the network as flexible as possible so a layer can take its inputs from different layers. Here’s the struct i use:
typedef enum {LINEAR, ANOTHER_FUNCTION} FunctionType;
typedef struct {
int numberInputs;
int* inputSize;
int totalInputSize;
float** inputNeurons;
int outputSize;
float* outputNeurons;
float* weighs;
FunctionType functionType
} Layer;
It supports different activation functions but i don’t want to focus on that by now.
I’m getting started, but i have already writen this kernel:
__device__ float func(float number, FunctionType functionType) {
switch (functionType) {
//TODO add different functions
//break;
//case ANOTHER_FUNCTION:
// return anotherFunction(number);
break;
case LINEAR:
default:
return number;
}
}
__global__ void LayerCalcKernel(Layer layer)
{
float result = 0;
int alreadyReaded = 0;
for (int i=0; i < layer.numberInputs; ++i){
for (int j=0; j < layer.inputSize[i]; ++j){
result += layer.inputNeurons[i][j] //i didn't compile it yet, so i don't know if this is allowed
* weighs[threadIdx.x*layer.totalInputSize + alreadyReaded + j];
}
alreadyReaded += layer.inputSize[i];
}
layer.outputNeurons[threadIdx.x] = func(result);
}
In this kernel i don’t control the blocks. I was trying to imagine a way to use shared memory, but i need all the inputs to compute an output neuron.
Then i saw a .pdf explaining how to optimize reduction functions and i thought it could be just what i needed.
It can be found here.
After reading it, i found that i could write another kernel with a thread for each inputNeuron instead of one thread for every outputNeuron as i did before.
If i do that i must call the kernel for each outputNeuron in each layer.
I’m thinking how to mix both approaches, but i don’t really know how to do it.
Any suggestion will be great.
I got another question.
I want to write another version of the kernel for binary neurons (they can take the values 0 or 1). In this case i would like to use single bits instead of floats for the input and output neurons. I have used XMM assembly instructions and masks and it was fine, but the weighs were bytes insted of floats.
Is there a way to optimize the access to single bits?
In the previous code i could change this:
result += layer.inputNeurons[i][j] * weighs[threadIdx.x*totalInputSize + alreadyReaded + j];
for this:
if (layer.inputNeurons[i].getBit (j) ) {
result += weighs[threadIdx.x*totalInputSize + alreadyReaded + j];
}
getbit (j) would be a method of a class and C++ is not allowed here, but i just try to clarify the algorithm.
It would be also needed:
if (binaryFunc(result))
layer.outputNeurons[threadIdx.x].setBit();
else
layer.outputNeurons[threadIdx.x].resetBit();
instead of:
layer.outputNeurons[threadIdx.x] = func(result);
Any ideas?
I hope i’ve explain it well enough. I’m spanish so i’m sorry if my english isn’t correct.
Thanks in advantage.