The Game of Life in CUDA

with my kernel I get

6.2 FPS on 9200M

32 FPS on GeForce GTX 285

with SCREEN size 1K x 1K

/usr/bin/ld: cannot find -lcutil
collect2: ld returned 1 exit status

btw this server has tesla C1060

that would be a good test for ur project (if i get it to run)

i am not sure about your SDK version.

I have version 2.3 SDK and according to it

cutil is available in
/home/bibrak/NVIDIA_GPU_Computing_SDK/C/lib/

bibrak@bibrak-laptop:~/NVIDIA_GPU_Computing_SDK/C/lib$ ls
libcutil.a libparamgl.a librendercheckgl.a

please try adding this path to LD_PATH…

… yeah the tesla will be useful.
major difference is that that tesla has 4 GB memory and my GTX 285 has 1 GB

NVIDIA CUDA Software Development Kit (CUDA SDK) one of the SDKs
Release Notes
Version 2.1 for Linux

there is a SDK installed in the /root directory by a sudo
i am just a user so i can only copy

in fact that’s all that root contains(see screenshot)
( i am not sure if i have to install anything from there , at first i tried the NVIDIA_CUDA_SDK_2.02 but there is also another .run file… and others too in the folder cuda… so its getting complicated)

a co-student told me that he managed to run his project without any installation of SDK or smthin (despite the infoz that professor gave me)

PS: there r two instersting folders NVIDIA_GPU_Computing_SDK and NVIDIA_CUDA_SDK
so i think that someone(a sudo ?) installed two versions of SDK

hmmm, I think we need help from some one else on this issue, I suggest lets create a new topic, and post the problem there, so people who have relevant experience can help us .

what do u say ?

BTW you can download and install the new version of CUDA sdk and toolkit there too

happy new year from Greece…

all matters we ll be discussed tomorrow or so :P

Make a version that runs on windows, and I will get you a bunch of beta testers… :)

Why not just try doing it yourself?

If you look at Talonman’s previous postings, you’ll see he is the prototype of a “power user” and the antithesis of a developer. ;)

I think my kernel should also work on windows.

so again i have to ask please look at it, and comment on, "Does it make COALESCED memory accesses ? "

OK, had some time to do some more optimizations and clean up, this version has repeat boarders meaning going out of one side you come back to the other.

on my quadro 5800 i get around 145 fps and 1000 gps (generations per second) this is when i render every 7th generation on the cpu i get 6 gps that is around a 200x speed up :)

a list of some of the optimizations:

render using pbo

use gl interop

pinned memory

double buffers

shared memory

multiple generations per render.

keyboard functions:

[n] - new board

[space] - toggle pause

  • - single step

    [g]- toggle gpu/ cpu

    [s] - toggle shared mem usage

    [+] - inc number of generations per render

    - decrease number of generations per render

    [esc] - exit

    I used this as an example in a lecture i gave in a course on gpgpu at uni…

    of course there are some more things that can be done, but i leave these to u guys…

    [codebox]

    #include <stdio.h>

    #include <stdlib.h>

    #include <string.h>

    #include <GL/glew.h>

    #include <cuda_runtime.h>

    #include <cutil_inline.h>

    #include <cutil_gl_inline.h>

    #include <cutil_gl_error.h>

    #include <cuda_gl_interop.h>

    #if defined(APPLE) || defined(MACOSX)

    #include <GLUT/glut.h>

    #else

    #include <GL/glut.h>

    #endif

    int* H_a;

    int* H_b;

    int* D_a;

    int* D_b;

    #ifdef DEVICE_EMULATION

    #define SCREENX 64

    #define SCREENY 64

    #else

    //#define SCREENX 512

    //#define SCREENY 512

    //#define SCREENX 768

    //#define SCREENY 768

    #define SCREENX 1024

    #define SCREENY 1024

    //#define SCREENX 2048

    //#define SCREENY 2048

    #endif

    #define XBLOCKSIZE 16

    #define YBLOCKSIZE 16

    #define XSMEM 18

    #define YSMEM 18

    //#define XBLOCKSIZE 32

    //#define YBLOCKSIZE 16

    //

    //#define XSMEM 34

    //#define YSMEM 18

    //OpenGL PBO and texture “names”

    GLuint gl_PBO, gl_Tex, gl_Shader;

    uchar4 *h_Src = 0;

    // Destination image on the GPU side

    uchar4 *d_dst = NULL;

    float POPULATION=0.03125; //Chance, that the Random Starting Population generator decides to create a new individual

    //float POPULATION=0.092125; //Chance, that the Random Starting Population generator decides to create a new individual

    bool g_pause = false;

    bool g_singleStep = false;

    bool g_gpu = true;

    bool g_smem = true;

    bool g_noise = false;

    bool g_ab = 0;

    unsigned int g_genCount = 0;

    int g_numIt = 7;

    int gpsCount = 0;

    int GetIndividual(int x, int y, int* Array)

    {

    return (Array[x+(SCREENX*y)]);
    

    }

    void SetIndividual(int x, int y, int val, int* Array)

    {

    Array[x+(SCREENX*y)]=val;
    

    }

    void SetIndividual(int x, int y, uchar4 val, uchar4* Array)

    {

    Array[x+(SCREENX*y)]=val;
    

    }

    int Neighbors(int x, int y, int* Array)

    {

    int i, k, anz=0;
    
    for (i=-1;i<=1;i++)
    
    	for (k=-1;k<=1;k++)
    
    	{
    
    		if (!((i==0)&&(k==0)))// && (x+i<SCREENX) && (y+k<SCREENY) && (x+i>0) && (y+k>0))
    
    		{
    
    			int nx = x+i;
    
    			int ny = y+k;
    
    			if (nx == SCREENX)
    
    				nx = 0;
    
    			else if (nx == -1)
    
    				nx = SCREENX-1;
    
    			if (ny == SCREENY)
    
    				ny = 0;
    
    			else if (ny == -1)
    
    				ny = SCREENY-1;
    
    			if (GetIndividual(nx, ny, Array)>0)
    
    				anz++;
    
    		}
    
    	}
    
    	return anz;
    

    }

    void SpawnPopulation(float frequenzy, int* Array)

    {

    int random, x,y;
    
    srand ( time(NULL) );  
    
    for (x=0;x<SCREENX;x++)
    
    	for (y=0;y<SCREENY;y++)
    
    	{
    
    		random=rand() % 100;
    
    		if ((float)random/100.>frequenzy)
    
    			SetIndividual(x,y,0, Array);
    
    		else 
    
    			SetIndividual(x,y,1, Array); 
    
    	}
    

    }

    void GenNoise(float frequenzy, int* Array)

    {

    int random, x,y;
    
    srand ( time(NULL) );  
    
    for (x=0;x<SCREENX;x++)
    
    	for (y=0;y<SCREENY;y++)
    
    	{
    
    		random=rand() % 100;
    
    		if ((float)random/100.<=frequenzy)
    
    			SetIndividual(x,y,1, Array);
    
    		//SetIndividual(x,y,x+SCREENX*y,Array);
    
    	}
    

    }

    void NextGeneration()

    {

    int x, y, n, a;
    
    for (int g=0; g<g_numIt; g++)
    
    {
    
    	if (!g_ab)
    
    	for (x=1;x<SCREENX;x++)
    
    	{
    
    		for (y=1;y<SCREENY;y++)
    
    		{
    
    			n=Neighbors(x,y,H_a);
    
    			a=GetIndividual(x,y,H_a);
    
    			uchar4 clr;
    
    			clr.x = 0;
    
    			clr.y = 0;
    
    			clr.z = 0;
    
    			clr.w = 0;
    
    			if (a>0)
    
    			{
    
    				if ((n>3) || (n<2))
    
    					SetIndividual(x,y,0, H_b);
    
    				else
    
    				{
    
    					SetIndividual(x,y,a==255?255:a+1, H_b);
    
    					clr.y = 255;
    
    				}
    
    			}
    
    			else if (GetIndividual(x,y,H_a)==0)
    
    			{
    
    				if (n==3)
    
    				{
    
    					SetIndividual(x,y,1, H_b);
    
    					clr.x = 255;
    
    				}
    
    				else
    
    					SetIndividual(x,y,0, H_b);  
    
    			}
    
    			SetIndividual(x,y,clr,h_Src);
    
    		}
    
    	}
    
    	else
    
    	for (x=1;x<SCREENX;x++)
    
    	{
    
    		for (y=1;y<SCREENY;y++)
    
    		{
    
    			n=Neighbors(x,y,H_b);
    
    			a=GetIndividual(x,y,H_b);
    
    			uchar4 clr;
    
    			clr.x = 0;
    
    			clr.y = 0;
    
    			clr.z = 0;
    
    			clr.w = 0;
    
    			if (a>0)
    
    			{
    
    				if ((n>3) || (n<2))
    
    					SetIndividual(x,y,0, H_a);
    
    				else
    
    				{
    
    					SetIndividual(x,y,a==255?255:a+1, H_a);
    
    					clr.y = 255;
    
    				}
    
    			}
    
    			else if (GetIndividual(x,y,H_b)==0)
    
    			{
    
    				if (n==3)
    
    				{
    
    					SetIndividual(x,y,1, H_a);
    
    					clr.x = 255;
    
    				}
    
    				else
    
    					SetIndividual(x,y,0, H_a);  
    
    			}
    
    			SetIndividual(x,y,clr,h_Src);
    
    		}
    
    	}
    
    	g_ab = !g_ab;
    
    	g_genCount++;
    
    	gpsCount++;
    
    }
    

    }

    device int Dev_GetIndividual(int x,int y,int* Array)

    {

    return (Array[x+(SCREENX*y)]);
    

    }

    device void Dev_SetIndividual(int x, int y, int val, int* Array)

    {

    Array[x+(SCREENX*y)]=val;
    

    }

    device void Dev_SetIndividual(int x, int y, uchar4 val, uchar4* Array)

    {

    Array[x+(SCREENX*y)]=val;
    

    }

    device int Dev_Neighbors_smem(int x, int y, int* Array)

    {

    int i, k, anz=0;
    
    for (i=-1;i<=1;i++)
    
    	for (k=-1;k<=1;k++)
    
    	{
    
    		if (!((i==0)&&(k==0)))
    
    		{
    
    			if (Array[x+i+(y+k)*XSMEM] >0)
    
    				anz++;
    
    		}
    
    	}
    
    	return anz;
    

    }

    device int Dev_Neighbors(int x, int y, int* Array)

    {

    int i, k, anz=0;
    
    for (i=-1;i<=1;i++)
    
    	for (k=-1;k<=1;k++)
    
    	{
    
    		if (!((i==0)&&(k==0)))// && (x+i<SCREENX) && (y+k<SCREENY) && (x+i>0) && (y+k>0))
    
    		{
    
    			int nx = x+i;
    
    			int ny = y+k;
    
    			if (nx == SCREENX)
    
    				nx = 0;
    
    			else if (nx == -1)
    
    				nx = SCREENX-1;
    
    			if (ny == SCREENY)
    
    				ny = 0;
    
    			else if (ny == -1)
    
    				ny = SCREENY-1;
    
    			if (Dev_GetIndividual(nx, ny, Array)>0)
    
    				anz++;
    
    		}
    
    	}
    
    	return anz;
    

    }

    global void NextGen_smem(int* D_a, int* D_b, uchar4 *dst)

    {

    int a, n;
    
    int y = blockDim.y * blockIdx.y + threadIdx.y;
    
    int x = blockDim.x * blockIdx.x + threadIdx.x;
    
    __shared__ int sdata[XSMEM*YSMEM];
    
    
    
    if ((x<SCREENX)&&(y<SCREENY))
    
    {
    
    	sdata[XSMEM+1+threadIdx.x+XSMEM*threadIdx.y] = Dev_GetIndividual(x,y,D_a);
    
    }
    
    __shared__ unsigned int y_min_off,y_max_off,x_min_off,x_max_off;
    
    // load upper line
    
    if (threadIdx.y == 0)
    
    {
    
    	if (y>0)
    
    		y_min_off = y-1;
    
    	else
    
    		y_min_off = SCREENY-1;
    
    	sdata[threadIdx.x+1] = Dev_GetIndividual(x,y_min_off,D_a);		
    
    }
    
    // load lower line
    
    if (threadIdx.y ==(YBLOCKSIZE-1))
    
    {
    
    	if (y < SCREENY-1)
    
    		y_max_off = y+1;
    
    	else
    
    		y_max_off = 0;
    
    	sdata[(XSMEM*(YSMEM-1)+1)+threadIdx.x] = Dev_GetIndividual(x,y_max_off,D_a);		
    
    }
    
    // load left line
    
    if (threadIdx.x ==0)
    
    {
    
    	if (x>0)
    
    		x_min_off = x-1;
    
    	else
    
    		x_min_off = SCREENX-1;
    
    	sdata[threadIdx.y*XSMEM+XSMEM] = Dev_GetIndividual(x_min_off,y,D_a);		
    
    }
    
    // load rigth line
    
    if (threadIdx.x ==(XBLOCKSIZE-1))
    
    {
    
    	if (x < SCREENX -1)
    
    		x_max_off = x+1;
    
    	else
    
    		x_max_off = 0;
    
    	sdata[threadIdx.y*XSMEM+XSMEM+XSMEM-1] = Dev_GetIndividual(x_max_off,y,D_a);		
    
    }
    
    __syncthreads();
    
    if (threadIdx.x == 0 && threadIdx.y == 0)
    
    {
    
    	sdata[0] = Dev_GetIndividual(x_min_off,y_min_off,D_a);
    
    	sdata[XSMEM-1] = Dev_GetIndividual(x_max_off,y_min_off,D_a);
    
    	sdata[XSMEM*(YSMEM-1)] = Dev_GetIndividual(x_min_off,y_max_off,D_a);
    
    	sdata[XSMEM*YSMEM-1] = Dev_GetIndividual(x_max_off,y_max_off,D_a);
    
    }
    
    __syncthreads();
    
    if ((x<SCREENX)&&(y<SCREENY))
    
    {
    
    	float noiseF = 0.0f;
    
    	n=Dev_Neighbors_smem(threadIdx.x+1,threadIdx.y+1,sdata);
    
    	a=sdata[threadIdx.x+1+(threadIdx.y+1)*XSMEM];
    
    	uchar4 clr;
    
    	clr.x = 0;
    
    	clr.y = 0;
    
    	clr.z = 0;
    
    	clr.w = 0;
    
    	if (a>0)
    
    	{
    
    		if ((n>3) || (n<2))
    
    			Dev_SetIndividual(x,y,0, D_b);
    
    		else
    
    		{			
    
    			Dev_SetIndividual(x,y,a==255?255:a+1, D_b);
    
    			clr.x = 255;
    
    			//clr.z = 255;
    
    		}
    
    	}
    
    	else if (a==0)
    
    	{
    
    		if (n==3 || noiseF > 0.8)
    
    		{
    
    			Dev_SetIndividual(x,y,1, D_b);
    
    			clr.y = 255;
    
    		}
    
    		else
    
    			Dev_SetIndividual(x,y,0, D_b);  
    
    	}
    
    	Dev_SetIndividual(x,y,clr,dst);
    
    }
    

    }

    global void NextGen(int* D_a, int* D_b, uchar4 *dst)

    {

    int a, n;
    
    int y = blockDim.y * blockIdx.y + threadIdx.y;
    
    int x = blockDim.x * blockIdx.x + threadIdx.x;
    
    if ((x<SCREENX)&&(y<SCREENY))
    
    {
    
    	n=Dev_Neighbors(x,y,D_a);
    
    	a=Dev_GetIndividual(x,y,D_a);
    
    	uchar4 clr;
    
    	clr.x = 0;
    
    	clr.y = 0;
    
    	clr.z = 0;
    
    	clr.w = 0;
    
    	if (a>0)
    
    	{
    
    		if ((n>3) || (n<2))
    
    			Dev_SetIndividual(x,y,0, D_b);
    
    		else
    
    		{			
    
    			Dev_SetIndividual(x,y,a==255?255:a+1, D_b);
    
    			clr.x = 255;
    
    			//clr.z = 255;
    
    		}
    
    	}
    
    	else if (a==0)
    
    	{
    
    		if (n==3)
    
    		{
    
    			Dev_SetIndividual(x,y,1, D_b);
    
    			clr.y = 255;
    
    		}
    
    		else
    
    			Dev_SetIndividual(x,y,0, D_b);  
    
    	}
    
    	Dev_SetIndividual(x,y,clr,dst);
    
    }
    

    }

    int iDivUp(int a, int b){

    return ((a % b) != 0) ? (a / b + 1) : (a / b);
    

    }

    void CUDA_NextGeneration()

    {

    int gridx=iDivUp(SCREENX,XBLOCKSIZE);
    
    int gridy=iDivUp(SCREENY,YBLOCKSIZE);
    
    int blockx=XBLOCKSIZE;
    
    int blocky=YBLOCKSIZE;
    
    for (int i=0; i<g_numIt; i++)
    
    {
    
    	if (!g_ab)
    
    	{
    
    		if (g_smem)
    
    			NextGen_smem<<<dim3(gridx,gridy), dim3(blockx,blocky)>>>(D_a, D_b,d_dst);
    
    		else
    
    			NextGen<<<dim3(gridx,gridy), dim3(blockx,blocky)>>>(D_a, D_b,d_dst);
    
    	}
    
    	else
    
    	{
    
    		if (g_smem)
    
    			NextGen_smem<<<dim3(gridx,gridy), dim3(blockx,blocky)>>>(D_b,D_a,d_dst);
    
    		else
    
    			NextGen<<<dim3(gridx,gridy), dim3(blockx,blocky)>>>(D_b, D_a,d_dst);
    
    	}
    
    	g_ab = !g_ab;
    
    	g_genCount++;
    
    	gpsCount++;
    
    }
    

    }

    int fpsCount = 0; // FPS count for averaging

    int fpsLimit = 5; // FPS limit for sampling

    int g_Index = 0;

    unsigned int frameCount = 0;

    unsigned int timer = 0;

    void computeFPS()

    {

    frameCount++;
    
    fpsCount++;
    
    if (fpsCount == fpsLimit) {
    
        char fps[256];
    
        float ifps = 1000.f / (cutGetTimerValue(timer))* fpsLimit;
    
    	float gps = 1000.f/ (cutGetTimerValue(timer))* gpsCount;
    
    	sprintf(fps, "The Game of Life: %3.1f fps %d generations %3.1f gen/sec gpu: %d smem: %d numIt: %d noise: %d", ifps, g_genCount,gps,g_gpu,g_smem,g_numIt,g_noise);  
    

    glutSetWindowTitle(fps);

        fpsCount = 0; 
    
    	gpsCount = 0;
    
    	fpsLimit = ifps/3+1;
    

    cutilCheckError(cutResetTimer(timer));

    }

    }

    #define BUFFER_DATA(i) ((char *)0 + i)

    void renderImage(bool bUseOpenGL)

    {

    if (bUseOpenGL) {
    
        cutilSafeCall(cudaGLMapBufferObject((void**)&d_dst, gl_PBO));
    
    }
    
    if (!g_pause || g_singleStep)
    
    {
    
    	if (g_gpu)
    
    		CUDA_NextGeneration();
    
    	else
    
    	{
    
    		NextGeneration();
    
    		cutilSafeCall(cudaMemcpy(d_dst, h_Src, SCREENX * SCREENY * sizeof(uchar4), cudaMemcpyHostToDevice));
    
    	}
    
    	g_singleStep = false;
    
    }
    
    if (bUseOpenGL) {
    
      cutilSafeCall(cudaGLUnmapBufferObject(gl_PBO));
    
    }
    

    }

    // OpenGL display function

    void display(void)

    {

    renderImage(true);
    
    // load texture from PBO
    
    glBindTexture(GL_TEXTURE_2D, gl_Tex);
    
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, SCREENX, SCREENY, GL_RGBA, GL_UNSIGNED_BYTE, BUFFER_DATA(0));
    

    // fragment program is required to display floating point texture

    glBindProgramARB(GL_FRAGMENT_PROGRAM_ARB, gl_Shader);
    
    glEnable(GL_FRAGMENT_PROGRAM_ARB);
    
    glDisable(GL_DEPTH_TEST);
    
    glBegin(GL_QUADS);
    
    glTexCoord2f(0.0f, 0.0f); glVertex2f(-1.0f, -1.0f);
    
    glTexCoord2f(1.0f, 0.0f); glVertex2f(1.0f, -1.0f);
    
    glTexCoord2f(1.0f, 1.0f); glVertex2f(1.0f, 1.0f);
    
    glTexCoord2f(0.0f, 1.0f); glVertex2f(-1.0f, 1.0f);
    
    glEnd();
    

    glBindTexture(GL_TEXTURE_2D, 0);

    glDisable(GL_FRAGMENT_PROGRAM_ARB);
    

    glutSwapBuffers();

    glutPostRedisplay();
    

    computeFPS();

    }

    void cleanUp()

    {

    cudaFreeHost(H_a);
    
    cudaFreeHost(H_b);
    
    cudaFree(D_a);
    
    cudaFree(D_b); 
    

    if (h_Src) {

        free(h_Src);
    
        h_Src = 0;
    
    }
    

    cutilCheckError(cutStopTimer(timer) );

    cutilCheckError(cutDeleteTimer(timer));
    

    cutilSafeCall(cudaGLUnregisterBufferObject(gl_PBO));

    glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
    
    glDeleteBuffers(1, &gl_PBO);
    
    glDeleteTextures(1, &gl_Tex);
    
    glDeleteProgramsARB(1, &gl_Shader);
    

    }

    void keyboard(unsigned char key, int x, int y)

    {

    if (key==27)
    
    {  
    
    	cleanUp();
    
    	exit(666);
    
    }
    
    
    
    if (key=='n')
    
    {   
    
    	SpawnPopulation(POPULATION, H_a);
    
    	cudaMemcpy(D_a, H_a, SCREENX*SCREENY*sizeof(int), cudaMemcpyHostToDevice);
    
    	g_ab = 0;
    
    	g_genCount = 0;
    
    }
    
    
    
    if (key==' ')
    
    {
    
    	g_pause = !g_pause;
    
    }
    
    
    
    if (key=='.')
    
    {
    
    	g_pause = true;
    
    	g_singleStep = true;
    
    }
    
    if (key=='g')
    
    {
    
    	g_gpu = !g_gpu;
    
    	if (g_gpu)
    
    	{
    
    		cudaMemcpy(D_a, H_a, SCREENX*SCREENY*sizeof(int), cudaMemcpyHostToDevice);
    
    		g_ab = 0;
    
    	}
    
    	else
    
    	{
    
    		if (!g_ab)
    
    			cudaMemcpy(H_a, D_b, SCREENX*SCREENY*sizeof(int), cudaMemcpyDeviceToHost);
    
    		else
    
    			cudaMemcpy(H_a, D_a, SCREENX*SCREENY*sizeof(int), cudaMemcpyDeviceToHost);
    
    		fpsLimit = 1;
    
    		fpsCount = 0;
    
    	}
    
    }
    
    
    
    if (key=='s')
    
    {
    
    	g_smem = !g_smem;
    
    }
    
    
    
    if (key=='+')
    
    {
    
    	if (g_numIt<100)
    
    		g_numIt++;	
    
    }
    
    
    
    if (key=='-')
    
    {
    
    	if (g_numIt>1)
    
    		g_numIt--;	
    
    }
    
    
    
    if (key=='r')
    
    {
    
    	g_noise = !g_noise;
    
    }
    
    
    
    display();
    

    }

    // gl_Shader for displaying floating-point texture

    static const char *shader_code =

    “!!ARBfp1.0\n”

    “TEX result.color, fragment.texcoord, texture[0], 2D; \n”

    “END”;

    GLuint compileASMShader(GLenum program_type, const char *code)

    {

    GLuint program_id;
    
    glGenProgramsARB(1, &program_id);
    
    glBindProgramARB(program_type, program_id);
    
    glProgramStringARB(program_type, GL_PROGRAM_FORMAT_ASCII_ARB, (GLsizei) strlen(code), (GLubyte *) code);
    

    GLint error_pos;

    glGetIntegerv(GL_PROGRAM_ERROR_POSITION_ARB, &error_pos);
    
    if (error_pos != -1) {
    
        const GLubyte *error_string;
    
        error_string = glGetString(GL_PROGRAM_ERROR_STRING_ARB);
    
        fprintf(stderr, "Program error at position: %d\n%s\n", (int)error_pos, error_string);
    
        return 0;
    
    }
    
    return program_id;
    

    }

    void initOpenGLBuffers(int argc, char **argv, int w, int h)

    {

    printf("Initializing GLUT...\n");
    
    glutInit(&argc, argv);
    
    glutInitDisplayMode(GLUT_RGBA | GLUT_DOUBLE);
    
    glutInitWindowSize(SCREENX, SCREENY);
    
    glutInitWindowPosition(512 - SCREENX / 2, 384 - SCREENY / 2);
    
    glutCreateWindow("The Game of Life");
    
    glutDisplayFunc(display);
    
    glutKeyboardFunc(keyboard);
    
        printf("Loading extensions: %s\n", glewGetErrorString(glewInit()));
    
        if (!glewIsSupported( "GL_VERSION_1_5 GL_ARB_vertex_buffer_object GL_ARB_pixel_buffer_object" )) {
    
    	    fprintf(stderr, "Error: failed to get minimal extensions for demo\n");
    
    	    fprintf(stderr, "This sample requires:\n");
    
    	    fprintf(stderr, "  OpenGL version 1.5\n");
    
    	    fprintf(stderr, "  GL_ARB_vertex_buffer_object\n");
    
    	    fprintf(stderr, "  GL_ARB_pixel_buffer_object\n");
    
    	    exit(-1);
    
        }
    
    printf("OpenGL window created.\n");
    
    // delete old buffers
    
    if (h_Src) {
    
        free(h_Src);
    
        h_Src = 0;
    
    }
    

    if (gl_Tex) {

        glDeleteTextures(1, &gl_Tex);
    
        gl_Tex = 0;
    
    }
    
    if (gl_PBO) {
    
        cudaGLUnregisterBufferObject(gl_PBO);
    
        glDeleteBuffers(1, &gl_PBO);
    
        gl_PBO = 0;
    
    }
    

    // check for minimized window

    if ((w==0) && (h==0)) {
    
        return;
    
    }
    

    // allocate new buffers

    h_Src = (uchar4*)malloc(w * h * 4);
    

    printf(“Creating GL texture…\n”);

        glEnable(GL_TEXTURE_2D);
    
        glGenTextures(1, &gl_Tex);
    
        glBindTexture(GL_TEXTURE_2D, gl_Tex);
    
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP);
    
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP);
    
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    
        glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    
        glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, w, h, 0, GL_RGBA, GL_UNSIGNED_BYTE, h_Src);
    
    printf("Texture created.\n");
    

    printf(“Creating PBO…\n”);

        glGenBuffers(1, &gl_PBO);
    
        glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, gl_PBO);
    
        glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, w * h * 4, h_Src, GL_STREAM_COPY);
    
        //While a PBO is registered to CUDA, it can't be used 
    
        //as the destination for OpenGL drawing calls.
    
        //But in our particular case OpenGL is only used 
    
        //to display the content of the PBO, specified by CUDA kernels,
    
        //so we need to register/unregister it only once.
    
        cutilSafeCall( cudaGLRegisterBufferObject(gl_PBO) );
    
    printf("PBO created.\n");
    

    // load shader program

    gl_Shader = compileASMShader(GL_FRAGMENT_PROGRAM_ARB, shader_code);
    

    }

    int main(int argc, char **argv)

    {

    int idev, deviceCount;
    
    cudaDeviceProp deviceProp;
    
    char *device = NULL;
    
    if(cutGetCmdLineArgumentstr(argc, (const char**)argv, "device", &device))
    
    {
    
    	cudaGetDeviceCount(&deviceCount);
    
    	idev = atoi(device);
    
    	if(idev >= deviceCount || idev < 0)
    
    	{
    
    		fprintf(stderr, "Invalid device number %d, using default device 0.\n",
    
    			idev);
    
    		idev = 0;
    
    	}
    
    }
    
    else
    
    {
    
    	idev = 0;
    
    }
    
    cutilSafeCall(cudaSetDevice(idev));
    
    cudaGetDeviceProperties(&deviceProp, idev);
    
    initOpenGLBuffers(argc, argv,SCREENX,SCREENY);
    
    glutDisplayFunc(display);
    
    glutKeyboardFunc(keyboard);
    
    glClearColor(0, 0, 0, 1.0);
    
    cudaMallocHost((void**)&H_a,SCREENX*SCREENY*sizeof(int));
    
    cudaMallocHost((void**)&H_b,SCREENX*SCREENY*sizeof(int));
    
    cudaMalloc( (void**)&D_a, SCREENX*SCREENY*sizeof(int));
    
    cudaMalloc( (void**)&D_b, SCREENX*SCREENY*sizeof(int));
    
    SpawnPopulation(POPULATION, H_a);
    
    cutilCheckError( cutCreateTimer( &timer));
    
    cutilCheckError(cutStartTimer(timer));  
    
    glutMainLoop();
    
    cleanUp();
    

    }

    [/codebox]

    new small one, hope this works

    [attachment=15168:gol.zip]

  • An elegant way of saying he only asks for stuff, but programs nothing. :)

    True I’m afraid…

    But also he is unquestionably a Nivida fan, and fights the good fight to further the GPU Revolution at all cost.

    @erdooom

    on Windows
    i get corrupted archive message when trying to open it(your attachment)

    i think if u save it as .cu the problem would be solved

    thank u anywayz

    attached a new small file, this should work, for some reason the nvidia server dosen’t handle bigger files very well …

    Can it be that same GPUs neen window size not 1000x1000 insted 1024x1024 (or other size / div 16 ) ?

    If i run gol on my 8800 GTX i have same red points (stable) at the all sides borders.

    That red borders are not in CPU mode.

    If i give size 512 or 1024 i didnt have that borders , not in gpu mode not in cpu mode.

    System Mac OS X , NV 8800 GTX ( you see it must be an hackintosh :) - orig. Macs only have GT not GTX )

    EDIT: Now ith newer GOL v6 that red borders are gone.

    What means the normal points in red ?

    If i press key N (new game) the gpu and cpu startscreen looks very different.

    On gpu lost of red dots. In cpu mode most are green. Changes (GOL itself ) looks same.

    I get 77 FPS GPU (8800GTX) and round 0.7 FPS CPU (C2D 3 Ghz)

    Here is my version. Tested on WinXP sp3, 9600 GT
    features:
    Packed bit matrix algorithm
    up to 16 steps per kernel launch
    aligned memory access
    direct rendering to glBitmap using interoperability
    1000x1000 ->30 fps, 100 steps per draw

    sounds impresive can we see your code ?

    yes i think i switched the red and green points between cpu and gpu, it happend by mistake but i left is since it give a good indication if you are running on cpu or gpu. i changed a bit of the code for the launch size. But i still don’t know how good it is when the screen size isn’t a multiple of 16. so don’t be surprised if you get funky behavior or crashes. how many generations per second are you getting ? have u tried increasing the number of generations per draw ?

    ok, but it is rather uncombed :"> It is win32 VC 2005 express edition project, but I hope it is portable without much troubles.