I am implementing a neural network and I am using genetic algorithms for training purpose. To calculate error rate, I am loading weights from chromosome to neural network(NN) and executing each training samples and obtaining an error rate. The problem is, if I pass a one NN instance to my kernel, all the threads will be using the same NN instance and the problem is, if one thread loads the weights and starts executing training data, another thread can load another weight values before the previous one finishes and getting its own results. Even if i create a NN instance per block, again i have the same problem. ideally, each thread should have its own NN to make training with samples however, due to the memory size constraints it is impossible.
So is there any solution for these kind of cases? I need something like atomic operation (load weights, executes all the training data and get the result) or any other solution for this problem?
I also am working with NN in CUDA. I’m not exactly sure if this is what you are asking, but the way I handle this is that I set up the kernel such that there will be one thread per gobal neuron in the 2D array. To avoid any type of computation collisions, I also create a buffered variable in the neuron structure and do all computations from variable value to buffer and then later copy the buffer values to the variable values.
My implementation is a neural net simulation, so I have synaptic connections between neurons. At each neuron I also define a variable buffer and update this when evaluating the synaptic weights. Later I use another kernel to sum all of these up from the post-synaptic receiving neuron perspective. Each neuron has to have a list of post synaptic neurons it is feeding and a list of pre-synaptic neurons that feeds it. If done properly, there are no read/write conflicts. I use linked lists for synapses lists, which is not good for memory coalesced access, but I haven’t come up with a workable array list that doesn’t have similar coalesced access problems.
If you do have some variables, such as learning error, that has to be calculated at the net level, be sure to use global or shared device memory and use atomic functions to sum up the error. Unfortunately, atomic functions are only for unsigned int types. I’m still trying to figure out how to do something like atomic functions for floats, but have had no luck so far. I’m actually at a tough point where I am trying to implement a net layer inhibition function based on excitation activation levels for all neurons in the layer, but I’m having problems avoiding read/write collisions. Any suggestions you might have run across in this regard would be welcome. Note, I’ve tried every combination of threadblock() and syncthreads() to prevent read/write problems, but they just will not work for the general case of many blocks of many threads. The atomic functions do seem to work. I’m guessing that CUDA is doing some type of memory lock on the variable while the atomic function is being evaluated.
Note, I found the atomicFloatAdd() function in this forum and implemented it in my code. It works as advertised, by drops my simulation speed by 50%. I might resort to scaling the float to integers and then scaling back the sum to float.
I see the suggestion for __syncthreads()…see the other postings on this “sync” topic (do a search)…the problem with syncthreads is that it is somewhat misleading if you don’t read the manual all the way thru on that topic…it only syncs threads within the same threadblock. (For comparision, see threadfence() and threadfence_block() in the CUDA programmers guide.) Really, syncthreads() seems like it should be called syncthreads_block() because it only syncs some of your threads, not all of them. I have not yet figured if you can use the “threadfence()” routine for this type of stuff…hmm…maybe somebody else knows…Syncthreads can lead to bad answers if you assume it syncs everything, as you scale up your code to run in multiple thread blocks. I found this out the hard way :) There is a good paper online regarding this syncing, google:
Inter-Block GPU Communication
via Fast Barrier Synchronization
(from Virginia Tech), brand new as of Sept09. I am experimenting w/ those methods. Should know soon how well they work!