General Purpose GPU Programming Is there a better way?

I have a complex program that I have been converting to run on my GPU’s (1 Quadro 3700 and 3 Tesla C1060’s). This program is a neuron simulation network and as such has some very complex data structures. The primary data structure is 1) arrays of neuron layers, 2) made up of 2D arrays or neurons, 3) made up of neuron structures with physical parameters 4) which include linked lists of synapses connecting to neurons both in the layer and in other layers.

Some of the required computations are very well suited to parallel processing even though the objects are complex and diffiucult to achieve in coalesced memory. Other computions are better done in a serial manner, particularly the initializations. In order to really see how GeneralPurpose GPU computing could really be, I have moved almost all processing down to the GPU level. The top level is a DirectX applications that talks my web cams and displays the neuron network visualization (whichever neuron parameter I choose to display) in 3D. I shuttle the web cam image from host to device and eventually shuttle the device populated texture back up to the host for display.

I have all of this basically working, although not optimized. But it seems that I have to jump through all sorts of crazy hoops in order to implement these general purpose GPU computations. I’m just wondering if there is a better way and if I’ve been overlooking the better way to do complex things on the GPU.

Could someone who has worked on complex GPU data structures and code comment on what I see as “hoops” as listed here?

  1. I declare much of the neuron net data as a file scope device array of structures of arrays of data.
  2. I dynamically allocate synapse connections in the host code and place them in linked lists in device code. All data at this point is in device global memory.
  3. All device structure data has to be allocated in host code but initialized in device code (right?).
  4. To do some of the initializations and synapse wiring, I really have to do this serially but in device code since it is device global data. So, I’ve created a number of silly appearing foobar<<<1,1>>>() kernel functions so I can execute on the device serially.
  5. Most of the runtime simulation can be done in parallel with foobar<<<n.m>>>() kernel functions.

My primary question whether initializations and assignments of global device memory structure elements always have to be done in device kernel code. I’ve played around some with cudaGetSymbolAddress() but this only seems to work for simple objects or top level objects, i.e., I haven’t had much luck getting the device address in host code of a structure object a level or two deep in the data structure.

My secondary question is whether I’m going about this in a too difficult manner.

My last question is that since debugging (in Visual Studio) in non-emulation mode is almost impossible, is there a better way to debug in hardware mode? Even the ability to write trace logs from a kernel function would be useful. Has anyone developed a re-entrant capable device code log to file function?

Well, this is a lot to ask in one topic, but this represents the point I’m at in a complex program, so I’m hoping to learn from others who have attempted very complex GP-GPU development.

Thanks in advance,
Ken Chaffin


About linked lists, you should remake them into some other sort of array. Quoting NVIDIA person “It sucks. Don’t do it. You need to rethink your algorithm in a serious way.”. This is discussed on…=71&t=77512

About initialization (if i’ve understood your question right), you may initalize your variables etc,. host side and then move this data to device side as you please. Or you can do a stupid foobar<<<1,1>>> as you please.

About non-emulation debugging, i wish that was possible too…

My best advice to you is to read the cuda programming guide.

Google for Nexus - this is going to be nVidia’s integrated debugger for windows to be released soon.


Nexus looks sweet :D

With rare exception, converting array of structures (AoS) to structure of arrays (SoA) is essential for GPU performance. Actually, this is a general performance “hoop” which is also applicable to SIMD units on CPUs.

Since I never miss an opportunity to mention Thrust, I’ll point out that Thrust’s zip_iterator allows you to “zip” an SoA into a “virtual” AoS. Specifically, when you access the i-th element of a zip_iterator (iter[i]) it fetches the i-th elements of each of the arrays and wraps them with a tuple structure.

You might want to consider a different data structure here. For example, I do a lot work with sparse matrices and one way to represent a sparse matrix is an array of linked lists - one list for each row of the matrix. Although convenient and conceptually simple, this format is quite bad for performance. Instead, we generally prefer other matrix representations (i.e. data structures) that efficiently encode the sparse matrix, but in a way more amenable to high performance computation.

If you’re interested we wrote about these techniques at length in the following papers:

(more info)

Generally, one wants to allocate on the host and initialize/compute on the device as much as possible. However, this bias only exists because the internal device memory bandwidth is over 100 GByte/s while the host to device bandwidth is closer to 5 GByte/s. If your initialization procedure is inherently serial, or you simply don’t want to parallelize it yet, then it’s completely acceptable to initialize on the host and then copy the initialized data over to the device. Running a single thread on the GPU is almost certainly slower than initializing on the host and copying to the device (at a meager ~5 GByte/s)

Note that data structures with pointers (like your linked lists) can’t be easily copied to the device. I suppose this is yet another reason not to use them :)

(Another shameless plug for Thrust)

Thrust’s algorithms may save you some time and effort.

IMO your next step should be to flatten your data structures out into arrays of integer indices instead of pointers. Indices have the advantage of being portable between host and device (data[index[i]] makes sense in both places, but a pointer does not) and can easily manipulated with other functions, like the algorithms in Thrust.

Thanks for the suggestions. It seems that the way I’m doing things is the correct way, albeit not optimized. I started out this effort by dual creating my 70MB data structure, once on the host and once on the device and then initializing on the host and copying to the device. I was just convinced that there was some better way to do this in a more “general purpose” computing manner. So, I changed all of this to just create the data structure on the device and then initialize it on the device.

I am aware of the problems with pointer lists, particularly regarding coalescing in memory. I will begin taking a look at optimizing my data structures, but I still want to do this by only creating the data structures once, on the device (from within host code). This is an experiment as much as anything as to whether I can implement anything I might think up, strictly in device space, both data and code.

Ken Chaffin

One idea is to use “indices” instead of pointers in GPU data structures

Hello Ken,

I am trying to reach out your same target. Well, I own just a GTX 275 but the concept is the same.

I completed a very easy program which implements just 2^16 neurons. Each one is connected to 64 presynaptic neuron given a certain distance.

No conduction delay has been implemented.

I am porting another program on the GPU which implements conduction delay and neuronal growth and Izhikevich’s model neurons on more layers.

On the CPU I am now using list structures via STL but would like not to use them for efficency but just plain arrays.

Maybe the CPU could arrange the huge data structure for GPU to achive best results.

At this moment the simple datastructure I use in my first program is very simple:

given the array of N neurons there are M presynaptic neurons so it’s just a plain

int presyn[N][M] whos values can vary from 0 to N-1 of course.

float W[N][M] contains the weights.

float dW[N][M] contains the derivative of the weights

float LTP[N] and float LTD[N] for the potentiation and depression of the synapse.

int spike[N] represent the spiking state of every neuron (0 or 1)

float i1[N] are the inputs

float v[N],vn[N],u[N],un[N] are the variables that describe the dynamic of the neuron.

I guess you started from a similar starting point, do you confirm me this?

I would like to talk about optimization and algorithms to implement in the best way all this.

Have a good day,


PS: to tell you the truth I don’t like pointers and linked list on the device ;-)

Hello Paolo,

Yes, it sounds as if we are working on some very similar problems. For a few years I’ve worked on implementing a hippocampus 3 layer neuron simulation array based on the work of Randall O’Reilly and Yuko Manukata as described in their textbook “Computational Explorations in Cognitive Neuroscience”. These neuron simulations are somewhat complex with membrane potential and activation potential, excitation and inhibition, leakage etc. As such my neuron data structure is about 72 bytes in size and doesn’t fit well with the CUDA array capabilities. It appears that our data elements are probably very similar but mine is all in a single structure and yours are broken up.

I have video coming in from a web cam (converted on CUDA device to grayscale) and I use that to stimulate the first neuron layer and I play around with forward and backward signal propogation in the net, sometimes with reference images on the 3rd layer and Hebbian learning. I like to use the video images so that I can display visualizations of the active net via textures in 3D space. I had been using 1 million neuron arrays for each level, but for CUDA I have reduced that to 64K neurons temporarily. My synapse structure is much simpler than the neuron and is really just a logical connection between neurons, intra-layer and inter-layer. I have some distance and psuedo-randomization parameters for wiring up the initial synapses. It’s easy to get way too many synapses if I try to duplicate a real hippocampus.

I originally wrote this program in C++ and I’m porting it to CUDA on the GPU for fun. I had to find something useful or interesting to do with my 3 Tesla board system. I’m proud to have such a personal system. I just put it together a couple of months ago. My system has 16GB of RAM and dual quad core 3.2GHz CPU’s. This is all hobby work for me (an expensive hobby!).

Since I originally wrote this application in C++, I made heavy use of pointers. I have had to convert almost all of these to to array elements. I will be converting my synapse linked list to an array soon. I’ll probably still wire the synapses up using a linked list and after completion, dynamically allocate an array of data or indices and use that at runtime. I still have a lot of work to do with optimization, particularly coelescing in memory. For now I’m less concerned about the efficiency than the technology of doing a very complex parallel processing application.

I maybe should look at breaking up my neuron structure array into multiple arrays of smaller pieces as you do in your data representation. That would be a lot more parallel friendly. In fact, I’m glad you showed your data structure as that gives me some good ideas.

I had to implement my own CUDA device memory allocator because I ran out of memory way too quickly when allocating many objects. This was particularly true for the synapse lists. If CUDA is using 16K or 64K page size on memory allocation, even 4GB is exhausted quickly.

I love pointers on the CPU, but I don’t like them on the device. They are just too difficult to use and inefficient.

I’m having a very difficult time debugging my simulation in non-emulation mode. I currently have a bug that I have been trying to track down for days. I look forward to NVIDIA’s Visual Studio compatible debugger.

Well, this is perhaps more than you wanted to know. I would be interested in hearing more about what you are doing. Is your work hobby or professional?

Here are my data neuron and synapse data structures:

struct SynapseType // this is actually a list of all the synapses


struct NeuronType *Neuron;    // the neuron that the synapse branch terminates at

FLOAT  Weight;                // 0-1.

FLOAT  Length;                // distance between the neurons

struct SynapseType *next;        // for a linear list of linkages


struct NeuronType


INT    Layer;                    // which Layer is this neuron a member of

FLOAT  MembranePotential;

FLOAT  PlusPhaseActivationPotential;

FLOAT  MinusPhaseActivationPotential;

FLOAT  BiasWeight;               // Neuron potentiation based on difference between plus and minus phase activationpotentials

FLOAT  gE;                       // Excitory conductance

FLOAT  gI;                       // Inhibitory conductance

FLOAT  gEBuff;                   // Buffered Excitory conductance

FLOAT  gIBuff;                   // Buffered Inhibitory conductance

struct SynapseType* IntraLayerExcitorySynapses;    // within the Layer

struct SynapseType* InterLayerExcitorySynapses;    // between Layers

INT    ActiveExcitorySynapseFeedingCount;

INT    ActiveInhibitorySynapseFeedingCount;

FLOAT  Error;


Ken Chaffin

Hello Ken,

I am glad you answered my email,

maybe differences arise in synapse model and neuron model. Nice you indicated out to me the work of Randall O’Reilly and Yuko Manukata. I will soon read about their work.

In the implementation I did prefer splitting the structure in independent arrays so that I can pass pointers to CUDA procedures directly. Well I guess this shoud be possible using struct {} too. I started out this work with fixed number of presynaptic neurons but It’s not nice. Nature does not run in such this way.

In my program the subroutine that determines the presyn neurons just count them, allocate dynamically the needed memory amount and put the index of the discovered presyn neurons in the allocated array. This is good for me but needs two steps to perform this operation of “find and link”.

As I can read out of oyur email you send inputs on neurons using frames of a film so that you can keep trace visually about the evolution of the network.

I am more raw in this :-) so I just send random spikes and see if neurons aggregate forms and monitor synapses strength.

I used a synapse model made by me in which values are 0…MAXW. The difference of many synapatic model I saw around is that if w (let’s say this is the synaptic weight) w has more inertia if in the proximity of the extremes 0 and MAXW. Well, I don’t know if Natures works in such this way but I like making experiments: I love math so much!

And… yes, you really have an enormous horsepower to use! I hope you will find out the best way of getting out the most satisfaction, nasty things: yes, it costs. Me too had to take chioces of 64K layers or even smaller. I do own just a 275GTX with 896MByte of ram so have to save memory and computational power.

I did some proto-programs in freebasic too. It’s so easy to use… would like to send you via email my neuron model if you’d like.

Well, it does not take in account values of real neurons but can simulate many behaviours of regular spiking, fast spiking, chattering, burst and so on.

If you think you’re not interested into it’s ok for me.

Things that I noticed: developing all these programs and using many energies of me when all these simulations are starting to work fine my neurons and synapses are so exhausted, at that point they start to work so bad… :-)

For me too this all is just hobby and love to explore ways and doing things by my own. Not just copying and implementing things of other guys.

Everywhen I do something by my own and think it does run inherently my expectations I get very happy and the self satisfaction grows up!

I guess it would be the same for you.

Unfortunately I never did debugging of a cuda program: I cannot 'cause I just have one board. Before this I owned a 8800GT, now detached by my pc.

Me too have a very big bug in my C++ (non cuda program) and trying to solve it. At a certain point you see spiking running out of dendrites! This alwasy

happen if many neurons do spike. It’s months I am working on that…

Ok, I go doing some other works now.

See you and a have a good weekend


Hello Paolo,

It is good to hear that others can be as obsessive about a hobby as I am.

To tell the truth, even though I have 3 Teslas in my system, I have been doing all of my development on my Quadro 3700 board. When I have all 3 Teslas enabled, it is like I have a 1200 watt heater in my room. I had to install a powerful A.C. fan on the side of my tower to blow high volumes of fast air over the 3 Teslas and the Quadro. So, I set up some hardware profiles and can boot with either 0, 1, 2 or 3 Teslas enabled. Other that testing that I could execute on the Tesla, I have kept the Teslas disabled until I get further along on this project. I eventually would like to do some brain system simulations where the Quadro and 3 Teslas are responsible for areas of the brain; hippocampus, cortex, frontal lobes etc. I would need to work out how to communicate between the boards.

There is quite a bit of detail in my neuron simulation that is not evident from my neuron and synapse structures. I keep quite a lot of information at the “layer” level, if the same parameter values can apply to all neurons in the layer. I also have a number of firing threshold functions, particularly a sigma function that may address the issue you mention of firing behavior at the extremes of the membrane potential. I got these out of the book I referenced. I also have noise functions that make the firing threshold noisy.

I did most of this C++ coding about 5 years ago. Now it is common for me to look at the code and think “did I write that?” or “what is this supposed to do”. I’m not totally back up to speed on my own code. At the time I was self employed and able to spend several months constantly developing this program. Now this is an evening and weekends hobby.

In addition to being able to input video “data” into the simulation, I also have a number of random seeding modes. For quite some time I worked with neural convolution kernels that resulted in cellular automata. It sounds like you enjoy watching these type of evolving patterns also. One of my favorite brain scenarios was what appeared to be an epileptic seizure or firing storm. I have also had sound input at one time but I think that is broken. Something that I did that has been very useful was to make my program able to switch which parameter is being displayed via keyboard presses. So, I usually watch the membrane potential, but I can watch any of other neuron structure parameters if I want.

The neural model I use does not mimic nature via spiking and firing rates. My synapses fire either all or nothing and then behave like an electrical circuit, perhaps a capacitor being charged through a resistor with leakage.

Well, I am avoiding work to find my CUDA bug. It was nice to hear the details of what you are doing. I welcome any comments, questions or sharing that you would like to do along these lines.

Ken Chaffin