The Game of Life in CUDA

I will get back to you on what .dll files I needed to make it run on Windows 7. :)

I try to help when I can!

Update: I copied 4 dll files in, but still could not make it happy.

I don’t think it like my X64 dll’s.

I can’t seem to find the x86 versions.

u said u run 16 generations per kernel, how do u do that with the boarders ? you would need a global sync for that to work.


I don’t know how to cure it. I assumed, you have not XP emulation mode in your win7. May be, somebody else have suggestions?

Kernel gets 64x64 block on input and produces valid 32x32 block.

thats a cool idea ! but it seems u have some bug, if u change the number of iterations per step after the board is already stable then things start firing up again which shouldn’t happen.

we should combine both implementations :) could get some crazy numbers

Thanks. The main idea is to keep whole row of cells in the unsigned int64. But it implies some pesky bit fields manipulations and register shortage.

I wonder, if more graceful “4 bit per point” algorithm can work better:


ull device devCombine( ull x )


return x + (x >> 4) + (x >> 8);


ull device devCombineM( ull x )


return x + (x >> 8);


const ull ComparingMask = 0xCCCCCCCCCCCCCCCC;

const ull FixingMask = 0x1111111111111111;

device ull devStepRow( ull xm, ull x0, ull xp )


ull res = devCombine(xm) + devCombineM(x0) + devCombine(xp);

res = res | (x0 >> 4);

res = res ^ ComparingMask;

res = res & (res>>1);

res = res & (res>>2);

res = res & FixingMask;

return res;


But it make 4 times less work, and sparse data is bad for memory (packing is possible).

If it is on the boards boundary, then I know about it. I’ll fix it sometime. If not, that is really bad, can you reproduce it?

Well, use my code if you wish, but I suspect, it would be easier to reimplement algorithm than digging up my sources.


I’m trying to understand the last code that you posted.

So if it is not too hard for you, give us some additional information about it… (in general)

Such as(to be more specific)

1)What is the exact purpose of using the

void GenNoise(float frequenzy, int* Array)


As far as i know the traditional Game of life aglorithm does not include anything about noise…

or i just dont get it wright… plz explain the use of noise

  1. you also have
void NextGeneration()

__global__ void NextGen_smem(int* D_a, int* D_b, uchar4 *dst)

__global__ void NextGen(int* D_a, int* D_b, uchar4 *dst)

If u want to implement shared memory isn’t just enough to use the first definition?Why is the second needed?


#define XSMEM 18

#define YSMEM 18

What r u doing here? u define what? shared memory cannot use the existing blocksize?


#define SCREENX 64

#define SCREENY 64


#define SCREENX 1024

#define SCREENY 1024


I guess that the default board dims in ur project are 1024x1024, in which case that changes to 64x64 and so what is the

thing that DEVICE_EMULATION does here?

#if defined(__APPLE__) || defined(MACOSX)

#include <GLUT/glut.h>


#include <GL/glut.h>


On which OS u have ur project tested? i just dont get the above code…

Have u run it on Windows(whith no changes)?

  1. At last for now…
void initOpenGLBuffers(int argc, char **argv, int w, int h)

fprintf(stderr, "This sample requires:\n");

fprintf(stderr, "  OpenGL version 1.5\n");

fprintf(stderr, "  GL_ARB_vertex_buffer_object\n");

fprintf(stderr, "  GL_ARB_pixel_buffer_object\n");


In order to run this project on Windows or Linux 64bit do i have to download something more?except from the NVIDIA SDK?

I think that some files(.lib or .h) concerning glut are included there…

So are there any additional files needed for this project,in order to download them, and run it properly using OpenGL?

(or nvidia SDK 2.3 is just fine)

excuse me if the answers are too obvious but i am noob in CUDA

Thank you in advance.

For trivia, your Game of life install might be fine…

I have tried to install about 3 other programs now, that have trouble finding some .dll on my system, but not on others.

This is the latest:…2204&page=3

I don’t hold the answer as to my issue, but still wanted to report back there is a very good chance you already have all the required files for a Windows 7 guy. ;)

Sorry about that.

There is a msvcrt.dll in one of the screenshots. May be, you have no MS VC redistributable package installed. It can be downloaded here:…;displaylang=en

Thanks for the post, I will most likely install that. :)

Only 1 thing bothers me is I am 7 64bit Ultimate, and it has:
Supported Operating Systems: Windows 2000 Service Pack 3; Windows 98; Windows 98 Second Edition; Windows ME; Windows Server 2003; Windows XP Service Pack 2

I will look into insuring it’s OK for me too.


are you using cyclic boundary conditions in your code?

I think he is using

as evident from …> the statements which I have made bold

[codebox]int Neighbors(int x, int y, int* Array)


int i, k, anz=0;

for (i=-1;i<=1;i++)

	for (k=-1;k<=1;k++)


		if (!((i==0)&&(k==0)))// && (x+i<SCREENX) && (y+k<SCREENY) && (x+i>0) && (y+k>0))


			int nx = x+i;

			int ny = y+k;

			if (nx == SCREENX)

			<b>	nx = 0;</b>

			else if (nx == -1)

			<b>	nx = SCREENX-1;</b>

			if (ny == SCREENY)

			<b>	ny = 0;</b>

			else if (ny == -1)

			<b>	ny = SCREENY-1;</b>

			if (GetIndividual(nx, ny, Array)>0)




	return anz;


I haven’t check the performance of your program. But in my experience on GT200 based hardware the most naive implementation already performs nearly optimal. Most optimizations actually make it slower. There might be some possibilities for optimization when using bitmasks to represent live and dead cells which would highly reduce memory bandwidth usage, but as my cells required more than two states I didn’t test that.

Hi all,

I have implemented an other CELLULAR AUTOMATA model called “Fish and Shark” problem.

I am attaching the source.

you can also find it on [url=“Google Code Archive - Long-term storage for Google Code Project Hosting.”]Google Code Archive - Long-term storage for Google Code Project Hosting.

How to use

First press ‘n’ to spawn population
second press ‘Space Bar’ to start the Game (kernel)
how can press ‘.’ at any time to do step by step transitions

Remember to press ‘q’ when exiting

just to confirm yes i am


this is my contribution to the community and my first post. There is comparison between CPU and GPU execution. I was able to achieve up to 8 times more speed on GPU for matrix 256 x 128. This is tested on “GE Force 9600 GSO”

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

#include <windows.h>

void gotoxy (int x, int y)


	COORD coord; // coordinates

	coord.X = x; coord.Y = y; // X and Y coordinates

	SetConsoleCursorPosition(GetStdHandle(STD_OUTPUT_HANDLE), coord);


void DrawCell(unsigned int *niz, int width, int height, char cellImage)


	int i, j;

	gotoxy(0, 0);

	for(i = 0; i < height; i ++)


		for(j = 0; j < width; j ++)


			printf ("%c", niz[i * width + j] == 1 ? cellImage : ' ');





__global__ void  StartKernel(unsigned int *deviceArray, unsigned int *deviceResultArray, int width, int height, int blockSize, int blockDimX, int blockDimY)


	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int bx = blockIdx.x;

	int by = blockIdx.y;


	int i, j;

	int pos = by * width * blockSize +  ty * width + bx * blockSize + tx;

	int posY = pos / width;

	int posX = pos - posY * width;

	int max_array_count = width * height;

	int neighbors = 0;

	if(pos < max_array_count && (bx < blockDimX - 1 || (bx == blockDimX - 1 && posX < width % blockSize)) && (by < blockDimY - 1 || (by == blockDimY - 1 && posY < height % blockSize)))


		for (i = posX - 1; i <= posX + 1; i ++)


			for (j = posY - 1; j <= posY + 1; j ++)


				if(i == posX && j == posY) continue;


				if (i < width && i >=0 && j < height && j >=0 && deviceArray[j * width + i] == 1)






		if (neighbors == 3) deviceResultArray[pos]= 1;

		else if (neighbors == 2) deviceResultArray[pos] = deviceArray[pos];

		else if (neighbors == 0) deviceResultArray[pos]= 0;	

		else deviceResultArray[pos]= 0;	



int GetNumberOfNeighbors(unsigned int *niz, int x, int y, int width, int height)


	int neighbors = 0;

	int xPos;

	int yPos;




	MAX_WIDTH = width;

	MAX_HEIGHT = height;


	for (xPos= x - 1; xPos <= x + 1; xPos ++)


		for (yPos= y - 1; yPos <= y + 1; yPos ++)


			if(xPos == x && yPos == y) continue;


			if (xPos < MAX_WIDTH && xPos >=0 && yPos < MAX_HEIGHT && yPos >=0 && niz[yPos * width + xPos] == 1)






	return neighbors;


int GetNewValue(unsigned int *niz, int i, int j, int width, int height)


	int value = 0;

	value = GetNumberOfNeighbors(niz, i, j, width, height);

	if (value == 3) return 1;

	if (value == 2) return niz[j * width + i];

	return 0;


unsigned int *CreateNewState(unsigned int *niz, int width, int height)


	unsigned int *returnArray;

	returnArray = (unsigned int *) malloc(sizeof(int) * width * height);

	int i, j;

	for(i = 0; i < height; i ++)


		for(j = 0; j < width; j ++)


			returnArray [i * width + j] = GetNewValue(niz, i, j, width, height);



	return returnArray;


int main(int argc, char* argv[])


	unsigned int cpu_timer = 0;	

	int itterationCount = 1000;

	char cellImage = 'o';

	unsigned int *niz, *hostResultArray;

	unsigned int *deviceArray, *deviceResultArray;

	int MAX_WIDTH = 256;

	int MAX_HEIGHT = 128;

	int BLOCK_SIZE = 16;

	int i, j;

	int startX = MAX_WIDTH / 2;

	int startY = MAX_HEIGHT / 2;




	niz = (unsigned int *) malloc(sizeof(int) * MAX_WIDTH * MAX_HEIGHT);

	hostResultArray = (unsigned int *) malloc(sizeof(int) * MAX_WIDTH * MAX_HEIGHT);

	for(i = 0; i < MAX_HEIGHT; i ++)


		niz[i * MAX_WIDTH + startX] = 1;


	for(i = 0; i < MAX_WIDTH; i ++)


		niz[startY * MAX_WIDTH + i] = 1;



	for(i = 0; i <= itterationCount; i ++)


		niz = CreateNewState(niz, MAX_WIDTH, MAX_HEIGHT);						




	printf("Processing time of CPU was: %f (ms)\n", cutGetTimerValue(cpu_timer));



	/*CUDA Kernel*/

	unsigned int gpu_timer = 0;



	niz = (unsigned int *) malloc(sizeof(int) * MAX_WIDTH * MAX_HEIGHT);

	for(i = 0; i < MAX_HEIGHT; i ++)


		niz[i * MAX_WIDTH + startX] = 1;


	for(i = 0; i < MAX_WIDTH; i ++)


		niz[startY * MAX_WIDTH + i] = 1;


	cudaMalloc((void**)&deviceArray, MAX_WIDTH * MAX_HEIGHT * sizeof(int));

	cudaMalloc((void**)&deviceResultArray, MAX_WIDTH * MAX_HEIGHT * sizeof(int));

	cudaMemcpy(deviceArray, niz, MAX_WIDTH * MAX_HEIGHT * sizeof(int), cudaMemcpyHostToDevice);


	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); 

	int dimGridX = MAX_WIDTH / BLOCK_SIZE + (MAX_WIDTH % BLOCK_SIZE == 0 ? 0 : 1);

	int dimGridY = MAX_HEIGHT / BLOCK_SIZE + (MAX_HEIGHT % BLOCK_SIZE == 0 ? 0 : 1);

	dim3 dimGrid(dimGridX, dimGridY); 

	for(i = 0; i <= itterationCount; i ++)


		StartKernel<<<dimGrid, dimBlock>>>(deviceArray, deviceResultArray, MAX_WIDTH, MAX_HEIGHT, BLOCK_SIZE, dimGridX, dimGridY);	  

		cudaMemcpy(deviceArray, deviceResultArray, MAX_WIDTH * MAX_HEIGHT * sizeof(int), cudaMemcpyDeviceToDevice);



	cudaMemcpy(hostResultArray, deviceResultArray, MAX_WIDTH * MAX_HEIGHT * sizeof(int), cudaMemcpyDeviceToHost);	


	printf("Processing time of GPU was: %f (ms)\n", cutGetTimerValue(gpu_timer));





	CUT_EXIT(argc, argv);

	return 0;



my question is how to solve iteration problem with cellular automata?

is this a good practice:

fir(int counter = 0; coutner < 10000; counter ++)


	StartKernel<< execution configuration>>(deviceStart, deviceResult, ...)

  memcpy(deviceStart, deviceResult, sizeof..., cudaMemcpyDeviceToDevice);


or there is some other solution for this?

thank you for your help

Since you have no global synchronization you to call your kernel multiple times. So there is probably no other solution.

But why do you need a memcpy in your loop? This would slow down your computation considerably.


thank you for your reply.

what is global synchronization?

here is my answer for your question.

array “deviceStart” has initial values of matrix

after first iteration array “deviceResult” is GOF. I calculate “deviceResult” using “deviceStart”

when one iteration is over then “deviceStart” becomes “deviceResult” and I go again.

kind regards