Why is this problem not well suited to GPU compute? Brute-forcing chess magic

Why is this problem not well suited to GPU compute? Or is it, and I am doing something wrong?

So far I have been unable to beat my CPU performance when trying to use some CUDA to brute-force magic values for chess bit-boards.

It is a pretty basic brute-force problem with each thread having a separate curand seed and looping through cases to see if the random magic value works. I’ve done some optimization iterations using nsight, but still am about 2-3 times slower than my CPU version.

> .\cudachess\x64\Release\cudachess.exe
34 multiprocessors detected, and 8 recommended blocks per
Starting loop with 272 BPG 128 TPB, 2000 max tries
That's 4268000 tries
Starting search 0 (101010101017e)
Full Search took 8.277000 sec (0.118868 per mtry)
Starting search 1 (202020202027c)
Full Search took 3.457000 sec (0.049647 per mtry)

> .\cudachess_bitboard_cpu\x64\Release\cudachess_bitboard_cpu.exe
Starting loop with 24 CPU threads, 3000000 max tries (0 padding bits)
That's 44aa200 tries
Starting search 0 (101010101017e)
Full Search took 3.735000 sec (0.051875 per mtry)
Starting search 1 (202020202027c)
Full Search took 1.891000 sec (0.026264 per mtry)

Nsight tells me I do not have good warp state statistics.

Warp Cycles Per Issued Instruction [cycle]	17.13
Avg. Active Threads Per Warp	14.66
Warp Cycles Per Executed Instruction [cycle]	17.13
Avg. Not Predicated Off Threads Per Warp	14.55

Most of the sampled stalling is stalled_wait.

It seems to me the logic is too sequential / dependent on early exits. Do you all think that is the case?

My kernel can be found here:

The core loop is:

    for (i = 1; i < MAX_TRIES; i++) {
        magic = curand(&state);
        magic |= ((uint64_t)curand(&state)) << 32;

        memset(used, 0, 1ull << (bitnpad - 3));

        // loop over all the cases we need to cover
        for (j = 0; j < jend; j++) {
            val = s_cases[j];
            val = magic * val;
            val = val >> shft;

            // if we fail due to a collision, exit early
            if (used[val >> 3] & (1 << (val & 0x7))) {
                magic = 0;
                break;
            }

            used[val >> 3] |= (1 << (val & 0x7));
        }

        // if it worked exit early
        if ((magic != 0) && (*out_magic == 0)) {
            // try to atomically set the value
            atomicCAS(out_magic, 0ll, magic);

            return;
        } else {
            // check if another thread solved it
            if (((i & (LOOPS_PER_CHECK - 1)) == 0) && (*out_magic != 0)) {
                return;
            }
        }
    }

Thank you for your help. I’m new to writing CUDA, and want to understand better when to use it.

That is interesting. Probably Cuda is not used enough for chess (only for the neural networks like Lc0). Would be interesting, whether it has potential for more.

That is not totally bad, you would loose perhaps 50% performance due to warp divergence (and possibly memory accesses). The question is whether you loose somewhere else 100x.

I would rather look, whether your code uses local memory due to dynamic indices, whether is has bank conflicts in shared memory, whether global memory accesses slow it down, and whether curand is fast enough. Also perhaps you can do the atomicCAS more locally (e.g. shared memory) - don’t you have to check its return value, too?

1 Like

I don’t need to check the atomicCAS return because I don’t care. It also is not in the bottleneck area. If there is already an answer there, that is fine, I just don’t want to clobber it halfway.

Looking at the profiler elsewhere inside the curand code it seems it is not a part of the problem.

I do think the memory access is slowing things down, as used is a relatively large array in local memory (0x8000 bytes per thread that can’t be shared).

How do I look for bank conflicts in shared memory?

Can you consider revamping the algorithm in the following way - would that be faster in the end?

Current GPUs have around 96 KiB to 128 KiB of shared memory depending on the model.

Store used three or four times in shared memory. So 3x/4x parallelization per SM.
Either or both use several threads to compute the i loop (MAX_TRIES) and/or the j loop (jend) in parallel as far as possible to give the threads something useful to do.

Not sure, whether that works better.

With your original approach, your local memory is cached in L1. You can shift memory between shared memory and L1. So if too many threads (more than 3 or 4) are active, the L1 cache is too small, and everything goes through the L2 cache.

That’s a good idea. I was worried that it wasn’t worth it to have threads for the j loop because jend will only be up to 4096, but some coordination there might be worth it.

I’ll also play around with moving s_cases into __constant__ memory, and used into shared memory.

Thanks for your help.

Does the j loop have to be done sequentially, e.g. does each iteration need results from the previous iterations? E.g. within the used array? You write the break is not needed.

I see that each i iteration has a reset used array.

Perhaps you can even do without used array and calculate the 4096 iterations every time? Or groups of e.g. 256 iterations within different threads.

Computations will be much faster than memory accesses.

1 Like

I know nothing about computer chess other than having looked at bit board manipulation before. The Chess Programming Wiki has this to say:

There are in main four ways how to use a GPU for chess:

  • As an accelerator in Lc0: run a neural network for position evaluation on GPU
  • Offload the search in Zeta: run a parallel game tree search with move generation and position evaluation on GPU
  • As a hybrid in perft_gpu: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree
  • Neural network training such as Stockfish NNUE trainer in Pytorch[2] or Lc0 TensorFlow Training

It I understand correctly, your work here is similar to the “perft” module mentioned above?

How heavily is the CURAND PRNG being used and does it figure into overall performance? From the generators offered by CUDRAND, I would suggest trying Philox, since as a counter-based PRNG it should have the lowest parallelization overhead, whether by leap-frogging or blocking. The default XORWOW PRNG should also be zippy, with lowest basic computation cost but higher parallelization overhead than Philox.

Is it correct to say that a high level description of your approach is to find out for a given mask if there exists a unique value in an array of 4096 values?

The problem I was solving was computing these:
https://www.chessprogramming.org/Magic_Bitboards

So yes the high level is a bruteforce to find for a given mask if there is a value that will produce unique indexes for each possible needed case. The smaller you can have the lookup tables, the more space you save. I choose the problem mostly as a toy problem for playing with CUDA.

I made some changes and finally am faster than my CPU version! Moving my “cases” array into __constant__ memory freed up enough room to use most of __shared__ memory for my used bitarrays, which cut my execution time more than in half.

I am certain there is also a smarter way to find collisions than the bitarray, but I am happy to have it going faster now. Further improvements I might play with are having some pipeline to use another kernel to process the cases in parallel instead of using an array for collision detection.

1 Like

Hi,
congratulations for the speed-up!

If you want to try out some alternatives to the direct multiplications, consider

a) Using Tensor Cores for integer multiplications
b) Using Tensor Cores for 1-bit operations

Tensor Cores allow you to multiply two vectors each from a small set of values, doing all combinations (which vectors to choose) in a very very fast way.

So your s_cases can be roughly one set and your random values the other set.
The vector dimension would be needed to have more bits.
The magic bitboard uses 64-bit multiplications, the Tensor Core only supports 8-bit integer multiplications.

The result is stored in a matrix for all combinations.
Multiplying two integer vectors calculates the scalar product.

The vectors of 1-bits operations calculate the AND or XOR of all the 1-bit multiplications.