N queens on CUDA Kernel very slow

I’m trying to adapt a code made for CPUs for more speed on GPUs, but I don’t know why it’s too slow.

Here’s the code where I prepare to use the kernel:

int execute_cuda (int *posy, int posx)


	int *d_retorno;

	int *d_ok;

	int *resultado;


	cudaMalloc((void**)&d_retorno, sizeof(int));

	cudaMalloc((void**)&d_ok, sizeof(int));


	*d_ok = 0;

	position<<<1, DIM>>>(d_retorno, tab_aux, tabuleiro, posx, d_ok);


	cudaMemcpy(posy, d_retorno, sizeof(int), cudaMemcpyDeviceToHost);

	cudaMemcpy(resultado, d_ok, sizeof(int), cudaMemcpyDeviceToHost);





	return *resultado;


And here is the kernel:

__global__ void position(int *d_retorno, char tabela[DIM], char tabuleiro2 [2][(2*DIM)-1], int posx, int *d_ok)


   int i = threadIdx.x;

   if ((tabela[i] != 'O') && (tabuleiro2[0][i+posx] != 'O') && (tabuleiro2[1][(DIM-1)+i-posx] != 'O')){

	  *d_ok = 1;

	  *d_retorno = i;



“tabela”, “tabuleiro2”, “posx” and DIM are constant (in the point of view of the kernel), I can put them in constant cache, but I don’t think it can resolve the problem.

What can I do to make it faster?

One more thing: Is there an instruction that ends a kernel execution like “break” ends a “while”? I could use it in the code.

I would like to thank any answer.

system: Winxp 32-bit

CUDA SDK and toolkit: 2.0

GPU: 9600GT

CPU: E8400, Core2Duo 3GHz

Visual Basic 2005

Have you played with altering the thread block size? 1 is tiny.

#define DIM 1024


position<<<1024, 1>>>(d_retorno, tab_aux, tabuleiro, posx, d_ok);

faster than

#define DIM 1024


position<<<512, 2>>>(d_retorno, tab_aux, tabuleiro, posx, d_ok);

Thanks, It’s faster now, but not too much, I think that I need the instruction to stop kernel, it will be useful to stop the execution when the kernel finds the position to place the queen.

Here’s an idea, and I have no idea about how this will perform: Create a shared array of integers with length = total number of threads, initialize all to 0. When a thread has found the position to place the queen, it sets the value for its entry in the array to 1. Each thread checks the contents of the array to see if any thread has a 1 in its place, and if it finds a 1, it doesn’t do the rest of the work in the kernel. Does this make sense?

One thing you might try is to get rid of the double-indirection array tabuleiro2 and use arithmetic to get the correct index. But it seems this is not a very good GPU problem, as you’re not doing any arithmetic. You must be terribly memory bound. Have you checked for uncoalesced memory accesses?

Thanks for the answer, but I think that it will not work. What is necessary is an instruction that stops the kernel and returns the control to the CPU to let the program continue.

There have been some other people who implemented n-queens in CUDA before, perhaps you can check out their posts for more info:



Thanks for the links, but I don’t understand:

What is a “Mio”? Is this 25.000/s and 15.000/s?

This is not a linear problem, 25.000 queens in a second does not mean that it can end 50.000 queen in two seconds.

My times:

CPU (Core 2 Duo, 3Ghz)

120.000 queens in 105 seconds

5.000 queens in less than 1 second

GPU (9600GT)

120.000 queens in more than 180 seconds

5.000 queens in 15 seconds

That’s the problem!

Thanks for the answers, the code was already right, the problem was a simple configuration at the Visual Studio!!
The hints about the block size and the shared array helped a lot.

Now I obtained:
Core 2 Duo 3GHz (OBS: It’s not optimized for 2 cores)
120.000 queens: 105 seconds

120.000 queens: 38 seconds

Thanks to everyone!!!
Now I’m gonna make it faster!

what was the configuration in VS that helped??

Did you use the CUDA VS Wizard that allows you to create a CUDA project in VS??