Some questions about CUDA and my problem

I have an artificial life/neural network simulation that needs more processing power. I’ve been looking at my options (multiple CPU, Cell, FPGA and now CUDA). CUDA appears to be the best option but I’m looking for some validation before I jump into it.

48% of time is spent in collision detection type work (millions of particles/objects), this I believe should map to CUDA with no problem

Another 48% of time is spent evaluating the neural networks for the “creatures”. This is the part that does not map very well to Cell (because memory access is not consistently adjacent and therefore can’t take advantage of vector SIMD operations, each SPE would process 1 scalar calc each time) but it appears that CUDA should work because, if I understand properly, while the operation must be the same the memory being acted upon does not have to be adjacent like in the Cell (correct?).

The neural networks are not fully connected, but each neuron can connect to any other neuron (forward or backward), and the networks are designed to scale to large sizes while preserving memory, so here is the current data structure (simplified):

Array of Neurons
…Previous cycle value
…Current cycle value
…Index to first inbound synapse
…Number of synapses

Array of Synapses
…Index of “from” neuron

After reading through docs and forums, I believe I could implement this as follows and achieve significant parallel processing without too much wasted processing, but I’m looking for validation that I’m on the right track. Each multi-processor would evaluate 1 neural network at a time as follows:

Loop through sections of the set of neurons (assuming all will not fit in shared memory at one time)
…Load X neurons into shared memory
…Loop through sections of synapses (this means entire set of synapses will be processed once for each group of neurons, there is some waste here)
…Load Y synapses into shared memory
…In parallel, multiple threads will process an assigned set of neurons (thread 1, neurons 1 though 4, thread 2, neurons 5 through 8, etc.)
…Thread will loop through synapses for assigned neurons and add the “from” neuron input value if the “from” neuron is in shared memory, otherwise it will add nothing (“from” neurons not in shared memory will be processed during other iterations of the neuron and synapse loops)


  1. Am I correct that, unlike the cell, parallel execution can be against memory that is not adjacent?

  2. Does the above logic appear to be a reasonable approach?

  3. My host program will only need to send and receive a small amount of data for each cycle of neural network calculations (<64k input, <64k output for all “creatures”), can I leave all of my data in the graphics card memory during multiple iterations of processing?

  4. If I can leave the data, how do I prevent the OS from wiping out the GC mem? Do I need to get 2 cards, one for OS display and one for my processing?

  5. If have 2 distinct routines that need extra processing, collision detection and neural network, if I use 1 card, do I code that all into 1 routine and just sequentially do one and then the other?

  6. To optimize usage of shared memory, I can do various things to use types smaller than ints (short, byte, etc.) Will I pay a performance penalty for using these data types due to conversion by the hardware?

  7. Are normal bitwise operations supported? I don’t remember if I saw these in the documentation or not.

Thanks in advance for any help.


I’ll omit the questions I don’t have an answer for - though I hope someone responds, they’re good.

Basically yes. Adjacent addresses may be faster, but there are alternative techniques such as shared memory (= user managed caching) or texture (Z-curve locality caching) - depending on access pattern.

Yes, no problem. Your app creates a “CUDA context”, basically, that is the context of your memory. In my app, the context isn’t destroyed until the entire program calls exit(), when it’s done automatically. As long as you’re in the same “stack space” so to speak, your data is OK.

You can use one card. If you have the budget and the motherboard, it may be better to have two cards. You’ll have a bit more responsive GUI stuff, a bit faster processing, and you’ll avoid some onscreen garbage that happens when you have device memory SNAFUs. I only have one, and it’s fine.

I’d start with that approach. You’re not bound to it, you can desynchronise the two functions, but it’s easiest and there’s less performance hazards IMO.

Shifts and logical and/or do work - I believe they’re all in there.

Good luck!

Great, thanks for the answers.

I’m pretty excited about CUDA. I was getting pretty disappointed that I would not be able to get the processing power required to see the results I hope to see with this project, but with this technology, I may be able to exceed my original plans (with enough systems and cards).

CUDA was a very opportune discovery for me. My problem is two-fold, the first part being very easy to implement on CUDA, the second part being harder, and it wasn’t clear if the required speed could be acheived. CUDA has been flexible enough, and everything seems to have worked out well for me, so I can recommend spending at least a week or two to check things outl.