Sharing data between blocks

I am working on a physics system, and I’m pretty much stalled at a point where my CUDA code is only 2x the speed of multi-threaded CPU code (not counting data transfer Host<->Device). This is not great. However, I think CUDA can do much better.

In my current implementation, each thread is reading in a constraint from global memory, relaxing the constraint and writing the new point data back out to global memory. (I have methods in place to stop data clashes) It is this global memory access that is killing performance. If I could load the geometry into shared memory, run a whole lot of rounds of relaxation, and then write it back out, this would be much faster. However, my geometry is much larger than one SM can handle, and the tiles/blocks need to interact at each iteration. I have of the order of 10000 particles and 40000 constraints. (Imagine a large piece of cloth, 100 points along each side).

I was planning out a method of tiling my geometry, and at the end of each iteration, I would share the bordering points between blocks at strategic points in the processing, but there is no way to synchronise between blocks, and if I launch successive blocks, then I lose the shared memory, and have to run everything back out to global memory again.

Is there a good way to work on large interlinked physics datasets that won’t fit into a single warp, or a single SM? …and still make use of as many cores as are available?

I image this is similar to the requirement for using more than one block for inverting a really large matrix.

how do the constraints map to the particles - e.g. 1 to 1; 1 to 4; 1 to 10000?
if you have an input array/ matrix of 10000 elements, how many elements does your output array/ matrix have?

On average, there are mostly 4 constraints per particle. It’s not strictly a grid mapping, as some of the particles only have one or two constraints, and some particles have more than 4 constraints. Constraints mostly references 2 particles. The output array of particle positions is the same size as the input array (10000), because each particle moves. Inputs = 10000 point positions (x,y,z) and 40000 constraint. Output = 10000 updated point positions.

This can be broken down where the interaction between adjacent blocks is more like 100 particles, but this interaction needs to take place at each relaxation iteration. Currently, all 10000 points come out of global memory and back again at each relaxation iteration, but it would be nice to keep most of the points in shared memory, and only exchange the interacting points through shared memory, but this would need synchronisation between blocks, which is not possible.

it seems as if the output - updated particle positions - are not only determined by the constraints - per particle, or in general - but also by other/ adjacent particles
this is the only reason that i can think of why you would wish to exchange/ export ‘interacting points’

is it really necessary to iterate on the constraints/ relaxations - why can you not simply complete all relaxations per block, before moving on to the next block, or is this what you are considering?

by synchronization between blocks, i take it that you mean preserve the order of blocks
there is at least 1 method that i can think of, that generally accomplishes this
you can either push a sequence of block numbers into (global) memory, and have thread blocks base the block they process next on this sequence; the sequence is read via an atomic
or, you could pass to a thread block the block numbers and count of blocks that it needs to process
in both cases mentioned, you would limit the number of thread blocks to the number of thread blocks you can seat per SM, multiplied by the number of SMs

Yes, each constraint links two points together, so the updated position of each point depends on the coordinates of the other points around it. In my current model, I am using one thread per constraint. Each constraint is ‘relaxed’ in turn, updating the position of the points involved in this constraint. Once all the constraints are ‘relaxed’, the whole process must be iterated many times to converge on the solution. I’m currently working in global memory, and launching a new grid of blocks/thread for each iteration. Just trying to figure out a way to get off that global memory.

So, between each iteration, the updated point positions must be shared between blocks. The synchronisation is required to make sure that the passing of updated points occurs for all blocks at the same time.

I wonder whether the most efficient way is to max out a single block with the max number of threads, and somehow schedule all the work into those threads, even though this will limit my code to using one SM, at least I can avoid the global memory. This will make the threads quite complicated, and they will have lots of loops in them, but it might be possible to keep the code clean and non-divergent.

My other though at the moment is to use the current architecture and work on minimising and streamlining the memory access. The current threads are pretty straightforward, and I’d like to keep them that way, in preference to a far more complicated thread that avoids global memory.

i think, even though you clearly have dependencies between the blocks of your 100x100 ‘cloth’/ grid, you may have sufficient block independency to process the blocks concurrently, to some extent

how many iterations before you reach equilibrium? or does this vary?

do you fold over the natural borders of the 100x100 ‘cloth’/ grid, or not?

so, your number of constraints - 40000 - is given by the number of ‘connections’/ links possible between particles?

if i understand your problem correctly, and i am by no means certain that i do:

relaxing 1 constraint can theoretically upset all particles’ distances of the 100x100 cloth/ grid

now, if this holds, the problem may perhaps also be seen as 40,000 100x100 matrix interpolations

immediately, this pushes the dependencies back, allowing significantly greater concurrency, and greater control over global memory use/ reuse