Hello all,

I have an algorithmic exercise: we’re given a cube of 4x4x4 cells, and we fill these cells with “0” or “1” in such way that they’ll give us full set of binary numbers from “0000” to “1111” along each of three axes. And we want to find all such combinations. For example, here is one of 41472 cubes:

0000

0011

0101

1111

0111

0010

1101

0001

1100

1011

1001

1000

1110

0110

0100

1010

All vertical “fours” make a full set, and all horizontal “fours” as well; and if you make “fours” along the z axis, there’ll be all 16 combinations.

An obvious and simple solution is to start filling cells from [0;0;0] and check numbers along axes after each fill, so we will not blindly test all 2^64 variants, but much less of them. This task is easy to parallel - we can start some 2^N threads, each fills first N cells with its number in the binary form (from 0 to 2^N - 1) and starts to fill the cells further with the common algorithm. It’s, as I wrote in the title, a kind of binary tree.

I’ve attached the final program to this topic. Non-CUDA branch is default, to compile for CUDA it is necessary to define USING_CUDA in parameters.

In this program I fill first M levels in one common calculating part (M number is “DEFAULT_PAR_LEVEL”, and it’s chosen 15 for my videocard), and then pass the partially filled cube to the videocard or to the emulator. The emulator runs 2^10 threads (THREADORDER definition), while the videocard runs 2^10 blocks with 2^10 threads each (BLOCKORDER and THREADORDER definitions).

The problem is that I do not gain any profit from the videocard. I got all cubes with my emulator which worked about 21 hours (ASUS P5KPL SE motherboard with Intel Pentium E5300 and 2x1 Gb of DDR2 1066 memory). I started the variant with CUDA usage and saw that it will take just the same time, maybe a bit more (22-23 hours). I’ve got MSI GeForce GTX 550-Ti with 192 cores. The level 15 has been chosen because lower levels cause timeout (longer than 10 seconds), and higher levels pass too fast. I have no idea how to optimize the “**device**” code to work more fast. And here I ask you just to take a look and tell where and what optimization is needed. Is it possible to make CUDA 10 times faster than on CPU? How to use resources effectively? Please.