global sync is a barrier that would stop all the blocks, until they all reach that point and finish there io transactions that happened before hand. You can by pass the mem copy very easily simply switch the buffers every time so that the result from the last kernel is inputed ans the begin for the next run. If you look at my implementation i did just that.
I’ve just started learning CUDA by examples from this book and I wrote simple GoL but it is terribly slow. The code is similar to yours but in fact it is slower than GoL written in C# using winForms. Have you got any idea why or what am i doing wrong?
Thx, in release mode it works about 4 times faster but still to slow. Do you know any other way for optimized. I want to make fluent animation in HD. Where should I search for optimization? Should I use constant memory? And how about blocks and threads. How combine them to get maximum efficiency?
In the book they used texture to fetch the data. This makes the code faster then using normal arrays. However on the Fermi architecture it is faster to use the shared memory in a simialr way as it is sued in Finite Difference problems or convolutions. I also suggest change the condition of live or death so that you do not use ifs all the time.