N-Queen solver for CUDA

I wrote a N-Queen solver for CUDA. At first I thought it would be quite slow because it may have a lot of divergent branches, but after a few tweaks the result is quite satisfying (though it does have a lot of divergent branches…). This program runs on a 8800GT faster than a 3GHz Core 2 Quad, which IMHO is not bad considering 8800GT does not even have a branch predictor :)

If anyone’s interested, the source code and executable can be downloaded here:


For those who haven’t heard about the N-queen problem yet (I had to look it up): It’s about finding an arrangement of queens on a chess board such that neither queen could capture any other queen within one move.

Sorry for being brief >.<

This program tries to find the number of possible solutions of N-queen problem on a NxN board. That is, the number of all possible arrangements where N queens on a NxN board and no queen can capture any other queen. The standard 8-queen has 92 distinct solutions.

I use a simple backtracking algorithm to enumerate all possible solutions. An array is used to simulate a stack for depth first search. Basically, the CPU generates a lot of configurations for upper rows, and the GPU compute the remaining rows. This way, a lot of threads can run independently on the GPU.

Since local memory is quite slow and registers can’t be indexed, I use shared memory for the stack arrays. Therefore, the number of threads is quite limited, to 96 threads per block. However, increasing the number of blocks is still beneficial, probably because it reduces the overhead for calling the kernels. The kernel also performs a reduction to sum the numbers computed by a block, so the CPU can have much less numbers to sum.

Basically this is a hybrid solution combines both CPU and GPU to perform the computation, but most of the computation time are still spent in the GPU, according to profiling results.

Are you using Dancing Links for backtracking?

There are fancy ways to share state among multiple threads to save stack space… basically you enumerate, but for the bottom (n) levels of the backtrack tree, you let threads use copies of just those leaves, and all the threads share the same common “top level” backtracking choices. Saves lots of storage (meaning thread registers) for complex problems since the threads may only be dancing with only a very small part of the state locally.

The downside is the extra complexity needed to wait for all the threads to finish, and run backtracking on the HIGHER levels to prep for the next thread spawn.

Just brainstorming, this may work well in CUDA, each block is a backtrack state with the (n) leaves remaining to backtrack with the block’s threads, the threads all work on those (final, leaf) backtrack options, when the block is done, so is the common shared memory root state. Use kernel calls to keep iterating.

Edit: actually I should have read your post better… you’re pretty much doing this now with the CPU doing the root levels!

Thanks for the input :)

By the way, while writing this program I found a weird bug. When timing the function running on CPU, like this:

start = clock();
end = clock();

both start and end will be “optimized” to the same place, thus no timing is done. I’m not sure whether it’s the clock() being optimized (wrongly assumed to return the same value each tmie) or the order of execution is changed, but the later is more likely. Change clock() to other timing functions such as timeGetTime() does not help. Also, it does not happen in debug mode. It also does not happen when do_something() called some CUDA functions.

I found a workaround by doing some “global” things in do_something(). For example, a printf fixes this. I use a global variable and change its value in do_something() to avoid this bug.

The cuda solution of n-queens problems seems interesting to me, but the code posted is hard to understand. I gave up trying to understand even the single threaded cpu solver, because I am not familiar with the meaning of all the mask arrays and numerous variables with meaningless names. Some comments in the code would be helpful, unless the explanation of this particular implementation can be found somewhere on the net. In that case a link would do.
Are you sure that a few months from now you will understand your own code?

Yeah, sorry about that. Most of the codes in this program are not very pretty. The recursive version (which is not used) is the most simple one and I’ll explain it here.

Since we only want to count the number of solutions, we don’t need to store the positions of the queens, we only need to know what positions are “masked out” by the queens on the previous rows. Since each row has only one queen, a queen can have three different directions to mask out lower rows: directly downward, left downward, and right downward.

For example, consider three rows:

. . . . Q . . .

. . . L D R . .

. . L . D . R .

L is left downward, D is directly downward, R is right downward.

In the program, we use bit fields to store these positions. “mask” stores directly downward mask, “l_mask” stores left downward mask, “r_mask” stores right downward mask.

To find an available position to place a queen for this row, we mix all three masks to create a “non-available position” mask, like this

unsigned int m = mask | l_mask | r_mask;

Then we iterate all zero positions in m (i.e. all available positions a queen can be placed on this row). To find a zero bit fast, we use:

unsigned int index = (m+1) & ~m;

This is a little trick. It works like this:

Suppose that m = 10010011, We want to find the first zero (from right), so we compute m+1= 10010100. Compare the two values:

10010011 (m)

10010100 (m+1)

The left bits are still the same, and the right bits are inverted to zero. Only the bit on the position we want is inverted to one. So we “AND” it with m inverted:

01101100 (~m)

10010100 (m+1)

00000100 (m+1) & ~m

we got the first available position for a queen. Then we can “AND” this position into m, to find the next position, etc.

Now we have a good position for this row, we need to generate masks for the next row. “mask” is simple, just “OR” it with the position we found. For “l_mask”, we need to “OR” it with the position, and then shift to left by 1 to create the new l_mask for next row. “r_mask” is similar, but shift to right by 1. This corresponds to the code:

total += solve_nqueen_internal(mask | index, (l_mask | index) << 1, (r_mask | index) >> 1, t_mask);

When “mask” is completely filled (we use a “t_mask” to check that), we got a solution.

By the way, the declaration of the recursive version is wrong, it should be

long long solve_nqueen_internal(unsigned int mask, unsigned int l_mask, unsigned int r_mask, unsigned int t_mask)

The original first argument “int n” is redundant and should be removed.

Although it’s not in the original recursive version, the functions that actually run have an additional trick: because a solution for n-queen can be flipped horizontally to create another solution (an exception is when n is odd and the queen on the first row is in the middle), so we only need to enumerate half of the solutions, and multiply by 2 to save time.

This is probably a timer resolution issue - depending on the elapsed time between the start and the stop you may not get reliable results. Double check the timer resolution on your system.

The most reliable way to time CUDA calls on GPU is to use CUDA events (see, for example, simpleStreams or asyncAPI SDK samples). Event timer resolution is the GPU clock period. You do have to be aware of the fact that events will only time GPU portion of the calls - events are recorded on the GPU, not CPU.


Oh, the timing on GPU functions works well. The problem is on CPU functions (which do not call any cuda functions). It does not work even if the CPU function spends a lot of time (e.g. a few seconds). After storing something into a global variable, it can be timed correctly. So I think it’s an optimization problem, perhaps related to some optimization flags. It’s probably not related to nvcc, though.


I am new to CUDA, and I tried a different randomized approach which may generate same solution multiple times, but is useful for finding solutions quickly when the problem size if large. E.g, it finds solution for problem sizes as big as 10000 queens.

The idea is to try putting the queens randomly and make sure that they are not attacking any other. The main advantage is that this can be done in parallel. So I thought of trying out CUDA. However I was disappointed - on a 8800GT, CUDA code ran around 10x slower compared to the same program on a quad core (QX9650@3GHz) :thumbsdown:

here is the kernel:


int checkPos(int curTryPos, int row, bool *cols, bool *firstDiag, bool *secondDiag, int n)


    if(cols[curTryPos] == true) return -1;

    if(firstDiag[curTryPos-row+n-1] == true) return -1;

    if(secondDiag[curTryPos+row]==true) return -1;

    return 0;



void putQueen(int curTryPos, int row, bool *cols, bool *firstDiag, bool *secondDiag, int n) 


    cols[curTryPos] = true;

    firstDiag[curTryPos-row+n-1] = true;




static void nQueens(int *results, int n, int *randNos, bool *scratch_cols, bool *scratch_diag1, bool *scratch_diag2)


    int threadNumber = threadIdx.x + blockIdx.x*blockDim.x;

    int arrIndex = threadNumber*n;

   int *currentTrial = &results[arrIndex]; // the current solution

   bool *cols = &scratch_cols[arrIndex];

    bool *firstDiag = &scratch_diag1[arrIndex*2];

    bool *secondDiag = &scratch_diag2[arrIndex*2];

   for(int i=0;i<n;i++) cols[i] = false;

    for(int i=0;i<2*n-1;i++) {firstDiag[i] = false; secondDiag[i] = false;}

   // try putting the queens at random

    for(int i=0;i<n;i++) {

        int randPos = randNos[arrIndex+i]%n;

        int curTryPos = randPos;

        if(randPos%2) {

            // try next right first, in case of collision

            while(curTryPos <n && checkPos(curTryPos, i, cols, firstDiag, secondDiag, n)) {



            if(curTryPos == n) {

                // no luck on right side.. try left

                curTryPos = randPos-1;

                while(curTryPos >=0 && checkPos(curTryPos, i, cols, firstDiag, secondDiag, n)) {



                if(curTryPos == -1) {

                    currentTrial[0] = -1; // no soln found

                    return;   // bad luck, can't do anyting



        } else {

            // try next left first, in case of collision

            while(curTryPos >= 0 && checkPos(curTryPos, i, cols, firstDiag, secondDiag, n)) {



            if(curTryPos == -1) {

                // no luck on left side.. try right

                curTryPos = randPos+1;

                while(curTryPos < n && checkPos(curTryPos, i, cols, firstDiag, secondDiag, n)) {



                if(curTryPos == n) {

                    currentTrial[0] = -1; // no soln found

                    return;   // bad luck, can't do anyting




        putQueen(curTryPos, i, cols, firstDiag, secondDiag, n);

        currentTrial[i] = curTryPos;


    // success


The function nQueens is passed a set of random numbers and given some scratch space to use. The results are collected later by the CPU (if the first element is not -1) This is almost excatly same as the CPU implementation I use.

This is my first CUDA program, so I know nothing about optimizing it for the GPU.

Any suggestions improving this? I am sure I am doing something wrong as the threads are totally independent and I expected at least 10x gain :(

The main speed problem with your approach would be path divergence among the 32 threads per warp.

And you don’t seem to be using any shared memory, so everything you do may be bottlenecked by bandwidth to global memory.

Alternative approach: try working with 64 bit masks (e.g. in unsigned long long or uint64_t variables ) representing the 64 squares of a chess board. The per-square mask for contains “1” bits for all squares that a queen on this particular square is able to capture. If a given random square collides with the aggregate masks of all queens placed so far (resulting from a bitwise OR operation of masks), you cannot place a queen there and you have to keep looking.

Thanks for the suggestions. I think I will need to modify the logic quite a bit to make any use of GPU.

This is what I feared. I guess the exact same CPU code won’t be useful - will need to re-think the logic.

I tried using shared memory - but it is very small. My problem size is > 10,000 queens (in 10000x10000 board).

Thanks for the suggestion, but this is not feasible for the problem sizes I am trying to solve (here n ~= 10,000, so I can’t have n*n masks)

OK I missed the part about the giant problem size. ;) So indeed some other smart solution may be needed.

Any expert suggestions?

Just a quick observation: if you use bitfield instead of bool array, it should be able to fit 10000 queens in the shared memory (you need 10000x6 bits, which is smaller than 16KB).

I tried to download your file nqueen.zip but it is not longuer available, Could you put it again?


Please download it here :)

Thanks a lot.

Hi Ping-Che,

In your parallel backtracking implementation, you generate some levels of the tree on the CPU then send the results to the GPU to continue.

I wrote a new thread at http://forums.nvidia.com/index.php?showtopic=161924 where I explain how I want to implement the backtracking algorithm.
This ensures dynamical load balancing.

Do you know if is it possible to do that in CUDA?


The non-recursive version gives an incorrect result for n = 1.

Hi Ping Che:
As you may know, at NVIDIA we are still using NQueen as one of the internal benchmarks we track on OpenCL and CUDA.
However, I have been told that we are seeing stability issues with NQueen when running on a GTX980. The issue is highlighted below based on feedback from our internal Compute team (if you could please take a look at the issue and provide some updates/suggestions, it will be greatly appreciated):
Command analyzed :- ./nqueen_cl.exe -platform 0 -local -noatomics 16

  1. This app en-queue 8 kernels and each kernel preceded by two write buffers.
  2. After completion of kernel it read backs the results.

The issue is how it is waiting for completion of kernels. It looks like it takes event from en-queued kernel and run tight loop clGetEventInfo to check completion and this might be happening in different threads i.e. several tight loops. Application I ran made 15300+ calls to this function. Due to this tight loop CPU usage will increase as well as scheduling of all threads which include driver threads will be affected and can cause variance in performance characteristics of the app, which is precisely what is happening. App should be fixed to use better mechanism to check completion of kernel which should not involve running tight loops.