I am currently trying to run a game-of-life-like simulation where each ‘organism’ is processed as a threads. Given a big grid size, I can have potentially more than 512 (max number of threads per block) of small units running around, how can I exceed this 512 limit and yet have different threads be able to influence and affect others?
Shared memory is only visible to threads in the same block. Physically, there is a separate bank of shared memory on each multiprocessor, but even when two blocks run on the same multiprocessor, they can’t see each other’s shared memory either.
If you make each update of the grid a single kernel call and double buffer the grid, then you won’t need any communication between threads (not quite true as you’ll see in next paragraph). You initialize grid #1 with the starting state, and then call your global function which does 1 iteration. Each thread reads its nearest neighbors and writes its new state to grid #2. Then you call your update function again, but this time swap the pointers so that they read from grid #2 and write to grid #1.
In order for this to be fast, you’ll have to go one step further, and have a block of threads cooperate to read into shared memory the relevant region from the larger grid. Then all of the threads can access their neighbor cell states without being inefficient and reading the same memory over and over from the master grid.