I’m making a cuda version of a type of recurrent neural network and I’ve run into a parallelization problem. In essence the system looks like this:
[Feedback Memory]->[MLP Neural Net]
For those unfamiliar with the subject, implementing a standard neural net is very simple: it’s just a bunch of matrix multiplications with a non-linear function applied here and there. It’s quite parallelizable and no problem for CUDA, you just use all the data directly and you get two orders of magnitude faster results than on a CPU. As the neural net is static, there are no time dependencies in the data. Sample n+1 is decoupled from sample n-1.
The problem is in the feedback memory. With feedback memories sample n+1 is a function of sample n. A typical data set has perhaps about a hundred features (columns) and tens of thousands of samples (rows). If you are forced to feed the GPU one sample at a time you can at most get ~100 threads working simultaneously (equal to the number of features (columns) in the data) - which is truly a misuse of the GPU and the performance will be comparable to CPU performance.
If I was to give another similar problem: suppose you have a few differential equations to solve but need to do it in a large number of steps.
This seems to me as a very fundamental problem that either 1) has no solution (i.e the problem is completely serial in nature and can’t be parallelized) or 2) has a standard solution or approach to solution.
So my question is if somebody has dealt with a similar problem or has any ideas on how to solve it?
What kind of differential equations do you need to solve? I’m fairly sure that people have done both ODE and PDE solvers on CUDA. Whether they will fit into your specific NN algorithm is another issue altogether, but you may be able to find some code if you search the forums here.
If the samples are decoupled, couldn’t you then multiple identical (in terms of inputs, hidden layers, etc.) neural networks in parallel on different parts of the data (i.e. split your source data into chunks), then combine the nets together at the end?
I mentioned the differential equations just as an example of a similar problem - for my particular problem they are not needed.
As for the samples being decoupled, the problem is that they’re not. As we’re talking about time series data (hence the feedback memories) - each sample depends on past samples. In the illustration above I used a simple recurrent connection which forms an infinite impulse response filter. In that simplified case the influence of a past sample decays exponentially. In reality I use a structure with digital gates so that the memory depth is really unlimited and exact. In short the samples have to be processed in strict order.
The second part (the traditional neural net) is no problem for parallelization, but it is of little use if the feedback memory is the bottleneck.
Many apparently non-parallel problems can be made parallel with the use of the scan algorithm. It is not immediately apparent whether it could be applied in this case, but if you read up on the scan (there is a nice whitepaper in the SDK) and look at it as a big hammer you may find out that your problem is a nail.
Here is a short paper of someone else taking a stab at a GPU based neural network. Perhaps it will give you some hints or at least have some useful references to pre-GPU parallel implementations.
GPU Neural Net
Thank you, I’ll look into it. It may be possible to do some variation on it, but I’ll have to read up more and think a bit about it.
Do you work on stored or real-time streaming data? You might consider doing a sort of software pipelining perhaps?
Stored data, at lest for now. I did consider a pipeline approach, but the depth of the serial process isn’t sufficient to to make it worth implementing. I can at best process a few samples in parallel, but that is two or three orders of magnitude less than what would be needed to speed it up.
There is one aspect to this problem that I haven’t mentioned and that is that I will be running an optimization algorithm (genetic or similar) on top of it. So in essence I’m not really interested in training one neural net at a time but a large number of them. This makes it possible (at least in theory) to get a fair amount of processing done even if it is sample-by sample as this could be done for many systems at once. It is still perhaps an order of magnitude short of the type of speedup you can get from running a standard static neural net (one at a time), but I may have to live with it.
I still have some experimentation and math to do for a modified scan algorithm, but it is likely it won’t work out. The scan algorithm can do this very well: output[i] = output[i-1] + f(input[i-1]) but my problem is (simplified) on the form output = f(output[i]) - input[i]. This may seem like a trivial difference, but when you break down the scan algorithm it becomes apparent that it’s really not.