I have an artificial life/neural network simulation that needs more processing power. I’ve been looking at my options (multiple CPU, Cell, FPGA and now CUDA). CUDA appears to be the best option but I’m looking for some validation before I jump into it.
Background:
48% of time is spent in collision detection type work (millions of particles/objects), this I believe should map to CUDA with no problem
Another 48% of time is spent evaluating the neural networks for the “creatures”. This is the part that does not map very well to Cell (because memory access is not consistently adjacent and therefore can’t take advantage of vector SIMD operations, each SPE would process 1 scalar calc each time) but it appears that CUDA should work because, if I understand properly, while the operation must be the same the memory being acted upon does not have to be adjacent like in the Cell (correct?).
The neural networks are not fully connected, but each neuron can connect to any other neuron (forward or backward), and the networks are designed to scale to large sizes while preserving memory, so here is the current data structure (simplified):
Array of Neurons
…Previous cycle value
…Current cycle value
…Index to first inbound synapse
…Number of synapses
Array of Synapses
…Index of “from” neuron
After reading through docs and forums, I believe I could implement this as follows and achieve significant parallel processing without too much wasted processing, but I’m looking for validation that I’m on the right track. Each multi-processor would evaluate 1 neural network at a time as follows:
Loop through sections of the set of neurons (assuming all will not fit in shared memory at one time)
…Load X neurons into shared memory
…Loop through sections of synapses (this means entire set of synapses will be processed once for each group of neurons, there is some waste here)
…Load Y synapses into shared memory
…In parallel, multiple threads will process an assigned set of neurons (thread 1, neurons 1 though 4, thread 2, neurons 5 through 8, etc.)
…Thread will loop through synapses for assigned neurons and add the “from” neuron input value if the “from” neuron is in shared memory, otherwise it will add nothing (“from” neurons not in shared memory will be processed during other iterations of the neuron and synapse loops)
Questions:
-
Am I correct that, unlike the cell, parallel execution can be against memory that is not adjacent?
-
Does the above logic appear to be a reasonable approach?
-
My host program will only need to send and receive a small amount of data for each cycle of neural network calculations (<64k input, <64k output for all “creatures”), can I leave all of my data in the graphics card memory during multiple iterations of processing?
-
If I can leave the data, how do I prevent the OS from wiping out the GC mem? Do I need to get 2 cards, one for OS display and one for my processing?
-
If have 2 distinct routines that need extra processing, collision detection and neural network, if I use 1 card, do I code that all into 1 routine and just sequentially do one and then the other?
-
To optimize usage of shared memory, I can do various things to use types smaller than ints (short, byte, etc.) Will I pay a performance penalty for using these data types due to conversion by the hardware?
-
Are normal bitwise operations supported? I don’t remember if I saw these in the documentation or not.
Thanks in advance for any help.