# Optimizing kind of a binary tree please advise on speeding up

Hello all,

I have an algorithmic exercise: we’re given a cube of 4x4x4 cells, and we fill these cells with “0” or “1” in such way that they’ll give us full set of binary numbers from “0000” to “1111” along each of three axes. And we want to find all such combinations. For example, here is one of 41472 cubes:

0000
0011
0101
1111

0111
0010
1101
0001

1100
1011
1001
1000

1110
0110
0100
1010

All vertical “fours” make a full set, and all horizontal “fours” as well; and if you make “fours” along the z axis, there’ll be all 16 combinations.

An obvious and simple solution is to start filling cells from [0;0;0] and check numbers along axes after each fill, so we will not blindly test all 2^64 variants, but much less of them. This task is easy to parallel - we can start some 2^N threads, each fills first N cells with its number in the binary form (from 0 to 2^N - 1) and starts to fill the cells further with the common algorithm. It’s, as I wrote in the title, a kind of binary tree.

I’ve attached the final program to this topic. Non-CUDA branch is default, to compile for CUDA it is necessary to define USING_CUDA in parameters.

In this program I fill first M levels in one common calculating part (M number is “DEFAULT_PAR_LEVEL”, and it’s chosen 15 for my videocard), and then pass the partially filled cube to the videocard or to the emulator. The emulator runs 2^10 threads (THREADORDER definition), while the videocard runs 2^10 blocks with 2^10 threads each (BLOCKORDER and THREADORDER definitions).

The problem is that I do not gain any profit from the videocard. I got all cubes with my emulator which worked about 21 hours (ASUS P5KPL SE motherboard with Intel Pentium E5300 and 2x1 Gb of DDR2 1066 memory). I started the variant with CUDA usage and saw that it will take just the same time, maybe a bit more (22-23 hours). I’ve got MSI GeForce GTX 550-Ti with 192 cores. The level 15 has been chosen because lower levels cause timeout (longer than 10 seconds), and higher levels pass too fast. I have no idea how to optimize the “device” code to work more fast. And here I ask you just to take a look and tell where and what optimization is needed. Is it possible to make CUDA 10 times faster than on CPU? How to use resources effectively? Please.

Oh, and the code - see the attachment.
cube_main.zip (4.5 KB)

The first thought is to use shared memory. But it’s a sparse resource. I can reduce size of each cell to 1 byte, but will it add speed? Say I’ll make not 1024 threads per block, but 32 or 64, and they’ll be using bytes in shared memory. Will speed increase?

I made some optimizations: packed all information about the cube into bits and used shared memory:

``````#define THREADORDER 7

__global__ void runParallelCubeSeekKernel(...) {

...

}
``````

Of course, all methods has been rewritten with shifts and masks to access bits needed. The result was about 7 times faster: 3,3 hours took the new version against 22 hours with the first version.

Can it be quickened even more? Maybe I should use CUDA driver API, not CUDA runtime?

You won’t get much of a performance benefit from switching to the Driver API. Rather, I would advise running your program through a profiler and seeing where the bottlenecks are and focusing on those sections of your code. Congratulations on getting the 7x speedup, that’s impressive!

I don’t quite understand what you’re trying to do. Maybe you could give a clearer statement of your problem?

What specific characteristics do the cubes you’re trying to find have?

I’ll try. We have a cube 4x4x4 - 64 cells, “small cubes”. Each cell contains “0” or “1”. Let’s choose one direction - vertical, along z axis. There are 16 columns, each column consists of 4 cubes. These columns are packed in a block 4x4. 16 columns of 4 cells. The order of cells in the column is important: columns “0101” and “1001” are not equal. So, each column represents a binary number of exactly 4 digits. Now see the magic match: we have exactly 16 4-digit binary numbers (0000, 0001, 0010, …, 1110, 1111) and exactly 16 columns. So we can make 16 unique columns and pack them 4x4, result will be the cube 4x4x4.

Now choose another direction - for example, y. Here we also see 16 lying “columns” of 4 cells. Do they have unique 4 digit binary numbers each? If not, I throw this cube away. I want a cube which has 16 unique 4-digit columns along z and along y. And also along x. How I look for them? Just fill cube cells and check this uniqueness whenever it’s possible - i.e. whenever a new column occurs.

Beginning is simple: we start from one corner and fill columns along x axis. Simply fill 1st, 2nd, 3rd and 4th cells with “0” (y coordinate is 0, z is 0), and now we have the first column along x. We mark that we have certain 4-digit number made of these 4 cells and continue filling. 5th, 6th, 7th, 8th cells with “0” (y is 1, z is 0) - we have another column along x, check its uniqueness, and the check doesn’t pass - we already have “0000” column. So we fill 8th cell with the next possible value - “1”, check uniqueness, and now check successfully passes. Then we fill 9th, 10th, 11th and 12th cells (y = 2, z = 0) with “0” - you see, we check only full columns, not partially filled. The check doesn’t pass, we fill 12th cell with “1”, but check doesn’t pass again - “0001” is the 2nd column. We tried all possible values for 12th cell, and now we heve to return to 11th level and put the next possible value there - “1”. No checks has been made, and we simply pass to filling 12th cell and check. As usual, we fill 12th cell with “0” again, check - and check passes at last, because “0010” column is unique for now.

Filling 13th, 14th, 15th and 16th cells (y = 3, z = 0) is more complex, because we check not only along x, but also along y. 13th cell finishes first column along y (1st - 5th - 9th - 13th cells), and this “0000” we have to mark in unique columns along y. Then we try to put “0” into 14th cell, and check fails, because we already have “0000” column along y. So we put “1” in 14th cell and go on to 15th cell. It successfully fills with “0” for now, and 16th cell too, 4th column along x is “0100”.

The further we go, the more limitations we meet, and we have to roll back to put “1” instead of “0” or back to beginning if we can’t fill current cell with anything else.

What if we finally have a cube that meets our conditions? We write it in the result file, or if we can’t reach this file now, we put it into temporary storage and continue seeking.

You see, this task can be easily paralleled: just tell, from which cell (“level” in the terminology of my program) we pass the partially filled cube to the kernel. We could start seeking from the very 1st cell, but it would take too much time for videocard and cause timeout, and that’s why I chose 14th level to continue searching in parallel mode: on my GeForce GTX550-Ti one kernel call takes 700 ms in average. When parallel processing is finished, I write all cubes found in the file, fill 13th level with the next value (or lower levels if needed) and load the kernel with the next variant.

Now I think I understand the problem, but I don’t have the time to fully digest your algorithm yet.

From what I have understood, you are trying to find all 444 cubes of binary cells that are distinguishable even after all possible rotations in 3-dimentional space. Am I correct?

Sorry for a bit late answer. Rotations? I didn’t think about rotations, but if you rotate the cube, it won’t loose its feature - all 16 binary numbers along each axis.

Concerning optimization. I’ve read that the less conditions you have the better. And that’s true about small loops too. Okay, I unrolled all small loops. But I still have a question about conditions. I have 3 checks in one methoed which is called very often. So do I need to move all these checks into the end of the method? Or they must be as close to the beginning as possible? In other words, which of 2 variants will be faster:

``````__device__ boolean checkCounts(const cube* pvar, unsigned int candidat) {

unsigned int columnValueX = calcValue(getColumnX(pvar));

if(columnValueX != NOT_FULL_COLUMN) {

if(!checkColumnX(pvar, columnValueX))

return false;

}

unsigned int columnValueY = calcValue(getColumnY(pvar));

if(columnValueY != NOT_FULL_COLUMN) {

if(!checkColumnY(pvar, columnValueY))

return false;

}

unsigned int columnValueZ = calcValue(getColumnZ(pvar));

if(columnValueZ != NOT_FULL_COLUMN) {

if(!checkColumnZ(pvar, columnValueZ))

return false;

}

return true;

}
``````
``````__device__ boolean checkCounts(const cube* pvar, unsigned int candidat) {

unsigned int columnValueX = calcValue(getColumnX(pvar));

unsigned int columnValueY = calcValue(getColumnY(pvar));

unsigned int columnValueZ = calcValue(getColumnZ(pvar));

if(columnValueX != NOT_FULL_COLUMN) {

if(!checkColumnX(pvar, columnValueX))

return false;

}

if(columnValueY != NOT_FULL_COLUMN) {

if(!checkColumnY(pvar, columnValueY))

return false;

}

if(columnValueZ != NOT_FULL_COLUMN) {

if(!checkColumnZ(pvar, columnValueZ))

return false;

}

return true;

}
``````

In usual programming, I’d say that 1st variant is definitely faster. But what is really faster for CUDA? Maybe there exists yet another variant?

I would recommend just timing both approaches. An optimizing compiler may be able to transform the second variant into something very similar to the first variant, so there may be no practical difference. Then again the necessary code motion may be inhibited due to access through pointers (although there don’t seem to be potential aliasing issues here). Instead of speculating it is really just best to time each version. Of course that advice applies equal to programming on CPUs and GPUs.