I am interested in doing neural network programming on a GPU, I was wondering if there are other people here engaged in the same sort of activity. If so, are you in the SF bay area? It might be interesting to form a group to exchange ideas about this application.
check the cuda zone, I believe you can find a link to a webpage. It also contains sourcecode as far as I remember.
Hi, Did you manage to gather a group for the NN library, or managed to do some coding on it. I am interested in doing recurrent NN tool in CUDA for speech recognition. any tips are appreciated.
Hello :-) i am Paolo from Italy,
i am very interested in spiking neural network on the GPU and i’m developing a project by my own using the izhikevich’s neurons model and conductance delay.
Are you interested also?
HI Paolo and all people!
I am also carrying out a project using spiking neural network similar to Izhikevich’s. Until now I used powerful PCs and cluster supercomputers (using MPICH package).
Now I am thinking about Tesla C1060. At present I am just beginning to study this mechanics so there are much unclear to me. For example, my networks now include
few hunderds neurons (with few hundered synapses per neuron). For sake of simplicity let my network consist of 240 neurons, randomly interconnected. Tesla has 240 processors.
What would be optimum computation model? 240 thread blocks with 1 thread per block? Sorry if my question is naive but I cannot understand exact relationships between so numerous
software and hardware entities like thread, thread block, warp, multiprocessor, scalar processor etc. Various manuals shed some light … but many dark corners remain…
Thank in advance for all suggestions.
PS Paolo, it would be extremely intersting to learn how do you project your network to processors/threads/blocks, what hardware do you use, what simulation speed increase
(compared to usual powerful PC) do you reach?
1 thread per block is like shooting your own leg. You need to have atleast 32 or 64 threads per block to make sure that WARP computing (array processor) is NOT wasted.
And you also need to have a watch on number of active threads per multiprocessor. It should be a minimum of 192. Thus if your block has 32 threads then make sure you have 6 active blocks running on each MP.
Also note that blocks cannot communicate or synchrnoize with one another. Atomic global operations are costly.
So, Your randomly interconnected neuron can pose challenges on CUDA. nonetheless, I dont have practical experience on neural network @ the momment. As some1 pointed, check the CUDA zone and look @ the source code
So for a large number of Neurons (a multiple of 512 to make it easy) would this give the optimum Grid/Block settings?
[codebox]int n_blocks = Num_Neurons/512;
dim3 dimBlock(512, 1, 1);
dim3 dimGrid(n_blocks, 1, 1);[/codebox]
It’s impossible to say, picking the best blocksize is basically a process of trial and error. You have to benchmark the real app with different sizes.
But first you should consider how you are going to parallelize your problem. One thread per neuron? Maybe one thread per synapse? How about even finer grained, maybe you can find a parallel situation within each synapse and/or neuron? Try to look for the finest possible decomposition and work your way from there - you will get most of CUDA if your threads are very lightweight and you have tons of them. Keep in mind that scheduling tens of thousands of threads is basically free, synchronization isn’t very expensive as well (it’s not MPI) and there’s very little overhead generated by such practice. 240 threads is nothing, you won’t even start to saturate the card.
Indeed, the scheme “neuron=block, synapse=thread” seems to be wise because about 90% of neuron state re-calculation time is synapse state recalculation and therefore at least 10-fold speed up may be expected.
The main problem is necessity to synchronize state recalculation of all neurons in the network - output of some neurons should be transferred to input of other neurons before next state recalculation.
Besides, the new portion of external (receptor) signal should be injected from the host before each network state recalculation…
It is unclear what is the cost of this synchronization - could it eat up all gained speed up? Could anyone give some rough estimation for TESLA C1060
Kernel-wide synchronization can only be achieved by letting the kernel end and then launching another. It’s not very expensive, kernel launches take around 15 microseconds (function call overhead) and this is an established method of doing iterative simulations. It’s safe to call multiple kernels (in a loop for example), they will execute sequentially and the next one will not start until the previous one has finished.
If you also want to modify data from the outside in an interactive manner, you’re gonna have to do frequent memcopies. You might want to look into a mechanism called streams, they allow you to overlap kernel execution and host<->device memory transactions.
Thanks a lot. The situation gets much clearer to me.
That looks good… More the number of neurons – better (so u have more blocks to feed the multiprocessors)
I am a beginner in NN. There is an issue at programming. I hope I can find some help here.
I have written a matlab code for a simple NN, containing 4 inputs, 1 hidden layer and two output which works well. To train the network, the trainlm (Levenberg-Marquardt Algorithm) is used. I want to make some changes at the equations related to this algorithm to modify the operation. so that is why I have to use ‘‘Variable Learning Rate’’ and add Landa to the equatios.
I have found that I should make the changes at the original trainlm code. but, unfortunately, I can not find the weights and bias equations as well as Learning Rate at the original trainlm code.
I was wondering if you guys can give me some advice to overcome this issue.
Thanks in advance.