The Game of Life in CUDA

hi all,

I am thinking of implementing “The Game of Life” in CUDA, I was wondering whether some one has already implemented it, so that I could work to optimise it.

Thanks

?

Go on with your project. Funny one but I think no one tried it yet ;)
I am just a bit worried that if the board becomes too big, the code may become memory bound. You might think about handling escaping entities in some sort of queue rather than in binary format of the board.

A friend of mine was thinking about doing something like that a few months ago and I did a quick research. I found nothing of interest then. He ultimately went for something different.

Go for it, it should work great.

I wrote a game of life program using pixel shader 1.1 back then. Then I found that NVIDIA had written a very similar one too.
Of course, the old pixel shader 1.1 program has several serious problems. Basically it can handle only one array, and the largest allowed texture size was not very big at that time.
To be able to handle real large game of life board, you need to cut the board into smaller tiles. That means you’ll have to handle border cases (i.e. the cells at the edge of tiles) and you’ll have to know when to add new tile and when to remove empty tiles (this is actually the most difficult to do back in the pixel shader 1.1 version, as it’s very difficult to verify efficiently whether a texture contains only black pixels).

If you want to go to a REALLY large size board (you could get, maybe, a dense 60K x 60K grid with a Tesla), you’ll need to store the board sparsely instead of as a dense array. A large portion of the board is probably empty at any given time so dense storage will waste a lot of space… Something similar to a sparse matrix representation. Maybe even a quad tree - that might allow you to save space/compute time by finding repeated 2D patterns.

This would also automatically make it much easier to optimize so that only cells which can possibly change are updated.

Yeah, that’s the problem. With only pixel shader 1.1, it’s easy to know when to add a new tile (just check the edge cells of existing tiles). However, it’s hard to know when to remove an empty tile. Basically it’s only possible to do so with CPU. That means the tile has to be read back and run through to check whether it’s fully black or not. That’s not going to be fast. Of course, you can ignore this and just let the number of tiles grow, but that not an ideal solution as many game of life patterns involves big moving objects, so it will leave behind huge number of empty tiles.

With CUDA I think it’s probably possible to do the entire sparse tile operation in the GPU.

I have written a little Game of Life program today with Cuda support. maybe u want to Compile and try. You need to link with all that CUDA stuff, and open gl. It’s written for windows, but maybe its portable, just try.

For explanation: SCREENX and SCREENY are defined as the dimensions of the Board, not the Screen (ok…of the screen, too), and POPULATION defines the density of the starting population, which is generated once at the beginning. If the Population consists only of static stuff and blinkers, u can restart it with ‘n’.

The age of a individual defines its color…new individuals appear green, and slightly change to red when getting older.

little impression:

External Media

The source code should be attached to my post…i hope…
GoL.cu (5.92 KB)

That builds and runs perfectly without modification on Linux.

Thats an indicator for that my teacher has done a good job. Feel free to modify my code in any way (e.g. what’s happening at the borders?)…
i’ve never heard of that algorithm, and read it in this thread :D, so maybe i’ve not thaught about everything clearly.

someone wrote, that the best starting density is at 0.3125…

€: just found an fault…starting population was wrong → fixed

also build no prob on win, attached is the code with a couple of tiny mods and a win exec.

press:

n - regenerate (from original)
esc - exit (from original)
g - gpu cpu toogle
space - pause on/off
. - single step

gol.zip (41.5 KB)

hehe, yes…there was still a cpu version in my code…but the performance is pretty poor on higher resolutions.

better try to make some kind of scaling, like the threadstarter wanted to…60000*60000, gogo! :D

dont forget to check free gpu mem ;O

even on the gpu the performance is preaty bad, no use of shared memory (really requires it). no use of vbo render and no use of gl interop … with these i expect about a x10 speedup

Stop complaining, start hacking. Go for it! ;)

i wanted to compile it on my linux ubuntu 9.10

and got this error.
can any one provide assistance ?


bibrak@bibrak-laptop:/media/Academics/Academic/Research/HPC/CUDA/Gmae of Life$ nvcc -I/home/bibrak/NVIDIA_GPU_Computing_SDK/C/common/inc -I/usr/local/cuda/include -L/home/bibrak/NVIDIA_GPU_Computing_SDK/C/lib -lcutil GoL.cu
In file included from GoL.cu:4:
/home/bibrak/NVIDIA_GPU_Computing_SDK/C/common/inc/GL/glut.h:64:1: warning: “APIENTRY” redefined
In file included from /home/bibrak/NVIDIA_GPU_Computing_SDK/C/common/inc/GL/glut.h:59,
from GoL.cu:4:
/usr/include/GL/gl.h:111:1: warning: this is the location of the previous definition
/tmp/tmpxft_000038ff_00000000-11_GoL.o: In function computeFPS()': tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xda03): undefined reference to glutSetWindowTitle’
/tmp/tmpxft_000038ff_00000000-11_GoL.o: In function display()': tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdb89): undefined reference to glMatrixMode’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdb8e): undefined reference to glLoadIdentity' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdbb2): undefined reference to gluOrtho2D’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdbbe): undefined reference to glClear' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdbca): undefined reference to glMatrixMode’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdbcf): undefined reference to glLoadIdentity' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdbdb): undefined reference to glEnable’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdbef): undefined reference to glBlendFunc' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdc0e): undefined reference to glColor3f’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdc1b): undefined reference to glPointSize' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdc27): undefined reference to glBegin’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdc97): undefined reference to glColor3f' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdcb4): undefined reference to glVertex3f’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdcdb): undefined reference to glEnd' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdce0): undefined reference to glutSwapBuffers’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdce5): undefined reference to glutPostRedisplay' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdcea): undefined reference to glFlush’
/tmp/tmpxft_000038ff_00000000-11_GoL.o: In function main': tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdf92): undefined reference to glutInit’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdf9e): undefined reference to glutInitDisplayMode' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdfb2): undefined reference to glutInitWindowSize’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdfbe): undefined reference to glutCreateWindow' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdfca): undefined reference to glutDisplayFunc’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdfd6): undefined reference to glutKeyboardFunc' tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xdffe): undefined reference to glClearColor’
tmpxft_000038ff_00000000-10_GoL.ii:(.text+0xe066): undefined reference to `glutMainLoop’
collect2: ld returned 1 exit status
bibrak@bibrak-laptop:/media/Academics/Academic/Research/HPC/CUDA/Gmae of Life$

nil

Thanks for sharing the code, u really did a great job in just a day.

BTW, looking at the code, i do not see any coalesced memory accesses and the code is highly divergent that may be the primary reason of not giving promising results as for as performance is concern.

to overcome these issues i am in a process of developing such data structures which may increase coalesced memory accesses.

here is my idea, if i am wrong any where please point me.

struct UNIT {

bool point[16];

} ;

struct GRID32 {

UNIT grid32 [2];
} ;

here “GRID32” is the basic building block, for the grid ( or the screen ),

here coalescing is achieved for any
pattern of accesses that fits into a segment size of 32 bytes for 8-bit words, (8-bit words = bools, if i am right)

this is the basic unit, here i am not allocating points (cells) linearly, where the threads with corresponding Index will access them.

furthermore if SCREEN size is to be 1000 X 1000,

then – ?

ANS …
int size = sizeof(UNIT) * 32 // = 32*32 = 1000 bytes = 1000 cells
malloc(…,size)


What remians …

if i am going in the right direction,

now a mapping mechanism is to be made, which decides which thread calculates which particular cell, and its neighbours.

it may be like … the image i am uploading
External Media

Hrm: i wrote that prog in 2 hours…

There are surely many ways to make it faster. I was just too lazy, maybe it’s really easy :)

how many constant memory is available? 256 byte? its surely too less, to store something efficient (there are no memory sections, which are used for every point in a block…too bad ;( )

but maybe coalescing is a great idea…just define a structure of 1616 points (integer values! …try with unsigned char…(coloring issue)), and make the global nextgen kernel calculating a whole block at once…so u got some control over the write acces. Be aware, that the usage of more registers/threat could also slow it down… also you should manage it in some way, to use all multiprocessors…with 1616 blocksize, you should be at something around ~4000 blocks in 1000*1000 res…perfect amount for my 9800 gtx :D

maybe it would be a great idea, to make memcpy DtoD from the so called D_b to D_a…so u dont need to transfer the current generation from host to device (thats easy, and probably a speedup of ~2x)

D_b allways contains the current generation (after it has been calculated once), so just copy D_b to D_a instead of H_a to D_a…should be similar.

hmm, ok i am going for the optimisation …

BTW i found the library for openGL

its -lglut