The Game of Life in CUDA

erdooom · March 10, 2010, 12:02pm

global sync is a barrier that would stop all the blocks, until they all reach that point and finish there io transactions that happened before hand. You can by pass the mem copy very easily simply switch the buffers every time so that the result from the last kernel is inputed ans the begin for the next run. If you look at my implementation i did just that.

Cheers

Eri

Milan.Dragovo · March 11, 2010, 8:37am

Hi erdooom,

did you mean something like this:

for(int counter = 0; coutner < 10000; counter ++)

{

	if(counter % 2 == 0)

		StartKernel<< execution configuration>>(deviceStart, deviceResult, ...);

	else

		StartKernel<< execution configuration>>(deviceResult, deviceStart, ...);

}

regards

milan

CapJo · March 11, 2010, 10:30am

Hi erdooom,

did you mean something like this:

for(int counter = 0; coutner < 10000; counter ++)

{

	if(counter % 2 == 0)

		StartKernel<< execution configuration>>(deviceStart, deviceResult, ...);

	else

		StartKernel<< execution configuration>>(deviceResult, deviceStart, ...);

}

regards

milan

This would be one possibility, but you can also simply switch the pointers to the arrays in your device memory

tmp = deviceDistanceVolumePitch3D_1.ptr;

		deviceDistanceVolumePitch3D_1.ptr = deviceDistanceVolumePitch3D_2.ptr;

		deviceDistanceVolumePitch3D_2.ptr = tmp;

bytecar · November 15, 2011, 6:29pm

I had written the Game of life in CUDA and optimized it using the cached 2D texture memory exploiting spacial locality. I thought to share in here.

janisz · April 1, 2012, 7:09pm

Hi
I’ve just started learning CUDA by examples from this book and I wrote simple GoL but it is terribly slow. The code is similar to yours but in fact it is slower than GoL written in C# using winForms. Have you got any idea why or what am i doing wrong?
[attachment=25056:kernel.cu]

pasoleatis · April 1, 2012, 7:19pm

Your code does not seem to be optimized at all, but you should check also if the compiler produces debug information.

janisz · April 1, 2012, 8:51pm

Thx, in release mode it works about 4 times faster but still to slow. Do you know any other way for optimized. I want to make fluent animation in HD. Where should I search for optimization? Should I use constant memory? And how about blocks and threads. How combine them to get maximum efficiency?

pasoleatis · April 3, 2012, 11:42am

In the book they used texture to fetch the data. This makes the code faster then using normal arrays. However on the Fermi architecture it is faster to use the shared memory in a simialr way as it is sued in Finite Difference problems or convolutions. I also suggest change the condition of live or death so that you do not use ifs all the time.

erdooom · April 3, 2012, 5:21pm

Hi,

Sorry to be so blut, but … are you kiding ? there are severl very optimized version of the code in this thread !!! just read through it :)

pasoleatis · April 3, 2012, 5:35pm

It is a good exercise to start from scratch.