How can I parallelize this?

Hi,

I have a problem, that can be solved easily on a cpu:
A (large) number of particles with different coordinates in two dimensions (x_i,y_i) are weighted on a spacial grid in the x-y-plane. Thus every particle is surrounded by four gridpoints. Depending on where exactly the particle is in this square of four gridpoints, every gridpoint (float) gets some amount of value added (linear weighting). There can be hundreds of particles in one of the cells, so evey gridpoints value after the whole calculation is the sum of all those particles in its cell.
That can be done by running a “for” loop with (i=0)…(i<No_particles) with just every particle in series.

My problem is: When I try to parallelize that, for example every particle gets his own thread, it can happen, that some of the threads try to write on the same gridpoint and i get garbage results. A atomic add for floats could help there i think. A reduction needs much memory, because i would need the whole grid for every thread, and the grid is to large (for example 50*300 points)…
Should I leave the task for the cpu?

Does anyone have any ideas?
Thx for all help!!!

ok, first idea i have is that one thread works on one cell.

the data is threedimensional, os every thread has a list of particles that he works on.

yes this just makes sense if the grid is big enough.

number of cells should be much greater than the number of particles per cell.

and the number of particles per cell should vary to much.

second idea is more difficult. every thread calculates one particle (as you mentioned)

you will have to use some kind of mutex to access the data every thread has to write.

the difficulty of this is afaik discussed in this forum (try search)

one idea to solve the problem i have is not to write on one memory postion.

you could probabkly use a sort of read-copy-update

you take a reference, read what is in there perform your add

and save the result to a new object referenced by a pointer

and then you use the cas with the value of the old pointer an change it to your new one

if the compare fails you have to repeat the procedure

i dont really know if this works never tried it (but would be interesting)

a code example for better understanding:

float results;//result array for block results

float *endResult;  //should point to a float variable =0

//get pointer: (READ)

float *pResult=endResult; //now we have a copy of that pointer

// calculate new value (COPY)

results[threadID]+=*pResult; 

//and now try to write back my pointer (UPDATE)

atomiCAS((int*)&endResult,(int*) pResult ,(int*)(result +threadID));

Ok, I forgot that: The number of particles can be a hundred thousand or a million or perhaps a hundred million, so there will usually be much more particles than gridpoints and also usually many particles in one cell.

1 grid point is shared by 4 cells. Is that right? (assuming 2D)

The idea would be to allocate one block for a group of grid points… If one could imagine the grid points as a rectangular mesh - you could divide them into many sub-rectangular meshes. Each such mesh will be mashed (calculated) by a block. Note: For each such Mesh, the block cannot calculate for the boundary points. It can only calculate for the inner grid points…

Within each block, a thread could be assigned to an inner-grid point.

You may need to re-design your data structures such that the calculation above can be done easily. — because the design migrates from “for each particle” to “for each grid point”.

Exactly!

I allready did that, because i needed that for another caluclation. The number of grid points is a multiple of 16 (or any other number), in x- and y-direction respectively, so i can devide the (recangular) plane in several 16*16 parts.

I’m not sure if I understood. In your suggestion, I would need to know which particle is in which block of cells, right? I don’t see any accelration in the code then, because I need to check that in series too… Perhaps I got you wrong.

Btw: Thx for your help

@ St3fan82:

I will test that, but I am afraid, that I have too many threads trying to access the same gridpoints, which causes long delay times.
Anyone here who tested that approach already?

Given a grid point (which is now owned by a thread in a block) – you would need to know what all particles can contribute to this grid point. THats why I said you need to organize your data-structure correctly.

This is easy to find, if your particles are organized in a particular way.

For example, say your particles are organized as [2D array of CELLs + 2D arrays of particles inside a CELL] - then the parallel approach outlined above will work fine.

It all depends on your data-organization of particles. If you can throw more light into it, we can discuss alternate parallel solutions.

Hi again,

Well at the moment all particle values are stored in an 1-D array of float2 values. The grid points are stored in an 1-D array of float values. This is just my first try, because i programmed the code to run on the cpu. If changes have to be made in the structure of data saving, i can do that.

The problem is, the particles will move every time-step. And for every time step the weighting on the grid-points has to be done again. If I save the “cell-information”, so I save in which cell the particles are in, I could do the weighting calculation to the grid points very fast. But then I have to calculate that information for every time step too. Or I need to implement some code that organizes the transfer from one cell to another, that adds new particles to a cell and removes particles from a cell very fast… I’m not sure how to do that.

The problem does get trickier with the introduction of time-step. Among other things,you can consider:

  1. Delta Values in grid point weights induced by movement of a particle. - so u need not calculate everything all the time. Just the delta alone (just like how video-encoding or version control software works).

  2. Look @ netting out the delta values due to particle movements… Like 2 particles (affecting the same grid point) moving in opposite directions may effectively nullify each other’s movement and so on.

  3. One could have a separate array called “Delta” array in the same form as “2D cell - 2D particle” array that would specify the “Delta” for each time-step

    One could write a GPU kernel to compute the delta function (the other GPU kernel being the main one that calculates the initial weights)

    This GPU kernel could do the netting out easily – because we are still having 1 thread per grid-point.

    a) This way you can calculate the weights once and store it permantely in GPU memory.

    B) Everytime you need a delta, you pass the delta array and call the delta kernel which keeps changing the grid point weights.

Best Regards,

Sarnath

What about that:

Do you see any chance to store the particles in all single cells seperatly, so a cell has a certain amount of float2 values, belonging to the particles in that cell?
Then, when particles move, normally most of them will stay in that cell (otherwise the timestep has to be choosen smaller, because the simulation gets unstable). Some of them will leave the cell and some new will come there from other cells. This would cause a memory reallocation for every cell at every timestep, because the amount of particles in the cell will change slightly at every time step.
I could calculate all cells independently, but what I need for that, is a “transfer”-possibilty, that adds particles to the cell and removes them. That transfer has to work somehow parallel, because every cell has at least 9 surrounding cells that can give or take particles.

In addition, partciles can leave the area at the boundary and also new particles are created in the area.

I thought about a possible way, what do you think about that:

  1. Every cell “i” has a certain amount of particles N_i, which locations are saved in an 1-D float2 array, that belongs to that cell. I also save in an extra integer array the amount of particles per cell, that means the N_i values.

  2. In a first step, I calculate for every cell the amount of values, that remain in the cell and save the (amount of) values in an integer array, call it R_i. So a value of 1231 in that integer array means, that 1231 particles remain in that cell.

During that calculation, I save the location of all remaining particles in a new array, starting with index 0 to index (R_i-1). The new array is thus taken in all it’s places. The location of all leaving particles has to be saved in an extra array.

  1. I have to save how many particles are added to each cell, so I create another integer array, that just saves the gain to that cell from other cells. That can be done with the atomicAdd function. The (int) result “j” of the atomicAdd function will belong to the particle that leaves the cell and indicates the new index of that particle in it’s new cell -> j + (R_i-1).

If I’m not wrong, all those calculations can be done seperatly for every grid-cell, only the atomicAdd functions for the losses will block each other a bit.
I hope one can understand the idea…

Philipp.

Yes true. But consider a linked list of arrays to manage things dynamically. To reduce contentions – you can swap particles between cells when particle movement happens in both ways (or) Otherwise, you can exchange the moved-out particle data with the last element in the cell’s particle array and reduce the total valid elements in the array by 1. So, it should not be a difficult process.

The overall order should be like this:

  1. Look @ exchanging particle data for all moved-out particles. Do the exchange.

  2. If u still have moved-out particles, swap them with the last-element and reduce the valid count by 1 for each such moved-out particle

  3. Accomodate new data from outside. Use the space left over by moved out particles. If u dont have space, then allocate new array and manage it.

The linked-list of arrays and their managemnt must be implemented as a class so that u can operate on it transparently.

If you think this is a big bottleneck, then you can consider super-cells (that comprise a cluster of cells, may be 16x16 of them as u suggested before) and thus reduce the inter-cell movement. Then, your computation could have redundant comparisons for particles that really dont contribute to a grid point.

Thinking about the computation, it would be prudent to assign a block of threads (with possibly 64 threads) to compute 1 grid point. This way, you can make sure that memory-accesses are coalesced – which will prove VERY crucial for your kernel.

If your computation proves to be the bottleneck when compared to management of data-structures – you could even consider use of sorted array or the likes to ease out your computation – if that makes sense. For example, with a sorted arrangement, it will be able to arrve at lower-bound and upper-bound of indices of an array where you can find all particles related to a grid point (with some redundancy i.e. sometimes more than all particles)

9 sorrounding cells? I think you mean 8.

Also note that for calculating 1 grid point, you need to work with 4 cells. Remember to use 1 full block for a grid point as I suggested before. You kernel will be extremely memory bound and without coalescing you will straighaway lose lot of performance.

Implementing the 3 steps (overall order thing) listed in my previous post may NOT really be as trivial as it looks. It might involve a good detaild algorithm (depending on how you update particles).

I can imagine a recursive or a stack-based algorithm assuming that you update particles one by one. But it really depends on how u update your particles. So, I really dont want to open the pandora box now.

Since i’m not a computer scientist, I’m just going to try ;)
I know on the CPU-RAM realloc takes like hours. On the GPU-RAM i really don’t know. I let you know when I’m ready, after Easter…
Thank you very much for your discussion!!!
Philipp.

In my method listed above, there is no re-alloc (linked list of arrays). You just keep growing if u have more OR you just shrink…

re-alloc usually involves a “copy” – which is why it is quite costly. In the method that I described above, there is no copy. U just append over or shrink your effective size.

Anyway, It was good to have this discussion. Thanks for that.

Catch u later,

Happy Easter,

Best Regards,

Sarnath