Hello all, I am trying to optimize a recurrent nueral netowrk algorithm and could use some help…
[codebox]global_ void
gatedRecurrentHiddenLayerKernel(double* layerOutput, double* nextLayerInput, unsigned int seriesLength, unsigned int dataPoints, bool calcError) {
const unsigned int tid_day = threadIdx.x*seriesLength;
const unsigned int tid_col = blockIdx.x*dataPoints*3;
unsigned int out_tid = blockIdx.x*dataPoints + tid_Series;
unsigned int tid_1 = 0;
unsigned int tid_2 = 0;
unsigned int tid_3 = 0;
double forgetGate = 0.0;
double forgetGatePreAct = 0.0;
double cellInput = 0.0;
double outputGate = 0.0;
double outputGatePreAct = 0.0;
double preOuputGate = 0.0;
double output = 0.0;
for(unsigned int i=0; i<seriesLength; i++) {
tid_1 = tid_col + tid_Series + i; // cell state
tid_2 = tid_1 + dataPoints; // output gate
tid_3 = tid_1 + 2*dataPoints; // forget gate
forgetGatePreAct = layerOutput[tid_2]+forgetGatePreAct;
forgetGate = forgetGatePreAct;
logSIGMOfPointerKernel(&forgetGate);
layerOutput[tid_2] = forgetGate;
outputGatePreAct = layerOutput[tid_3]+outputGatePreAct;
outputGate = outputGatePreAct;
logSIGMOfPointerKernel(&outputGate);
layerOutput[tid_3] = outputGate;
cellInput = tanh(layerOutput[tid_1]);
layerOutput[tid_1] = cellInput;
preOuputGate = cellInput + forgetGate*preOuputGate;
output = tanh(preOuputGate)*outputGate;
nextLayerInput[out_tid+i] = output;
}
}[/codebox]
Since y(t) is dependent on x_t as well as x_t-1, every series ends up running sequentially (~200 data points make up a series). There are 192 “series” in a column of data, and ~20-40 columns of data. Therefore I’ve been setting the number of threads to 192 and the number of blocks to number of columns. This leaves a single thread to process everything within the for-loop; the kernel call with 30 blocks, a block with 192 threads, and each thread with more than it’s fair share of work…
As expected, the execution times are pretty bad. ~50% of the total program execution is inside this kernel right here. I’m looking into ways to estimate the time series by dividing it into smaller series, but I’m not yet sure how I would get away with it without introducing an unacceptable amount of error.
So if anyone has any suggestions based on the posted code about, it would be appreciated.
Thanks,
Tim