2 GTX295 SLI Nqueens project

Hi all as a capstone project I am running an nqueens calculator on 2 GTX295 SLI cards. They are in a Rampage 2 Extreme system with 6gb XMS3 1600, and a i7 processor. I already have nqueens running on it successfully but i think it is only runing on 1 card, out of 4. The code has defined threads as 96 fom what i can tell can’t these cards run more threads than that? FYI ran 20 nqueens in 57 minutes. I will upload my code as soon as i can.

To fully utilize the GPU, you want to be running many thousands of threads to keep all the SMs busy. IE 1000 blocks of 256 threads is pretty reasonable.

Nqueens seems to be a trivial problem but actually a very fun one, and requires real design to work for higher N>20!
IN fact from googling it looks like the solution count has only been computed for up to N=23, so a CUDA search might be fun to find the higher N values.

Ah, there is also a nqueens@home project, and it’s solved up to N=25 (using 72 CPU-years of compute). It’d be fun to pass it!

Actually n25 has been archived on 260 clustered computers using cpu it took six months

Any progress getting all 4 devices to run?

What operating system and drivers?

Good luck!


First off, I don’t think that CUDA officially supports SLI configurations (unless something has changed recently), so you’ll need to “downgrade” to a non-SLI setup. Second, if you want to utilize all 4 processors, you’ll need to do some multi-GPU programming (CUDA doesn’t automatically scale to multiple GPUs); there is a multi-GPU example in the CUDA SDK, and some good examples scattered around on the forum here.

Also, I remembered that someone did an nqueens solver a while back, and found this for you: http://forums.nvidia.com/index.php?showtopic=76893

They’ve posted their code in that thread, so take a look, perhaps it will help. Again, if you want to speed it up more, you’ll need to adapt it to your system using the multi-GPU methods (and change any constants, etc. that are specific to the newer cards vs. the older card he used).

Quick question on 295 cuda. I’ve read around the web that you need to have DVI or some sort of output cables attached to the extra outputs and monitors to get the crunching working on the extra cores when you have SLI or a 295. (apparently it can just be to your monitor’s 2nd DVI in, you don’t even need to display anything) Apparently games and stuff will run off of 1 output just fine but CUDA won’t from what I’ve read. Anyone know if there’s any truth to this?

Only on vista I think. Maybe on XP. It is not necessary on Linux.

herrbifi – have you been able to determine if it is using all 4 devices?

To Nvidia CUDA Team – can you point us to any reports where someone

is able to exploit all available devices in a multi GTX 295 setup? I’m keen to

get a set of these, but I have not been able to find anyone who can say

that their OS or drivers are able to use all GPU cores…


This thread details some experiments getting a quad GTX 295 up and running:


Aside from the extreme power requirements of 4 cards, it seems to work.

We’ve tested eight CUDA devices with the 180.xx drivers and it works.

(On Linux, at least, I don’t really run Windows that often!)

Thanks tmurray. Just to confirm:

Is this 4 x GTX 295 (i.e. not 2 x S1070)? When you say “it works”, does

this mean “faster than 8 x GTX 260”, or “well, we see 8 devices…”

Hoping to see some results before buying a fleet of these :unsure:

I only really test with Tesla products. If GeForce works, that’s fine, but Tesla is what my group cares about.

Whether it’s faster depends entirely on how you’re measuring. If you’re basically doing a raw instruction throughput test, yes, it should be faster, but if you’re ever hitting PCIe you’re going to have a lot of bottlenecks there.

(also if you’re seriously thinking about buying a fleet of these and keeping them for any length of time, you should just spend the extra money and buy S1070s. I used to be the kind of guy that would happily build the craziest things ever in order to save money, but it would always break at the most inopportune times and fixing it was always a huge hassle. at this point in my life if I have to really monkey with it on a regular basis when I would otherwise be doing real work I don’t want it. this is why I have a Mac laptop instead of a Dell or Lenovo running Linux–could not deal with “oops wireless doesn’t work now that I’ve upgraded my kernel” or anything like that.)

Scaling will depend a lot on the particulars of your application. Before you blow a ton of money on this, you might want to try a single GTX 295 and see how you program handles to two devices. Just in making that step you’ll discover many unexpected scaling bottlenecks in your code. :)

Our code is beautifully matched to CUDA. Time dependent DG (Discontinuous Galerkin) solver for PDE systems. The Host (main GUI thread) uses METIS to partition the problem for N devices, allocates all necessary data, then creates N workers. Each worker creates a context on its own device, loads its necessary subset from the Host data, then binds this to a set of textures on its device. From here it Just Runs. In its own time, each thread updates its own partition in the Host result arrays. After every Nstep tsteps, the Host tells the lads to chill, extracts the latest results, and gets Vtk (OpenGL) to do the SciViz.

So, during a simulation, there is no cudaMalloc/Free. Each device simply folds its arrays of floats over each other to calculate part of PDE solution, with a WaitForMultipleObjects(N, …) to synch an inter-partition Send/Revc. My mission (which I chose to accept :) ) was to make this run on N devices without the overhead of shared memory MPI. Going from 1 to 2 x 9600GT gives 99% speed up. Easiest next step would be (2 x GTX 285) which would Just Work.

But if a GTX 295 makes fully available 2 GTX 260’s per slot, well, we have to try. Only one problem: I have found no one able (or willing?) to report that (2 x GTX 295) >= (4 x GTX 260), even for cherry-picked (non-PCIe bound) tests. Does anyone have any news on this?

Why would that not be the case? If a single GPU of GTX295 is equally fast or faster as a GTX260, then why would 2x GTX285 not be equally fast or faster as 4xGTX260?

I would not hold your breath for anyone to confirm because I don’t think a lot of people have been running 4xGTX260, and even fewer will have switched to 2xGTX295…

Good point! :D But there are those few pioneers who’s progress we have been

following: 3x and 4x GTX 295!? I was hoping by now someone would have reported

great satisfaction and astounding results.

I have no doubt the hardware is ready, and on paper all looks good. The question is:

which current drivers can exploit this compute capacity?

What do drivers have to do with this…?

No the question is which programs/algorithms can exploit this compute capacity. And you just stated that your code scales wonderfully well with more GPUs, so it looks like your code can exploit this compute capacity. ;)

As I’ve reported in the thread mentioned above, four GTX295 cards run just fine (over days) given the right environment.

I can only compare to a single GTX280, which is what I have in my development machine. So, I will compare 1 GTX280 GPU to 1 GTX295 GPU. From what I can tell (I’m pretty new still to CUDA) the performance is as follows:

Early version of the kernel, heavy use of lmem (large arrays used in parallel bitslice), only a few bytes are transferred between host and device, several seconds running time of each kernel call (had some trouble with lock-ups)

GTX280 25 Mio results/s

GTX295 15 Mio results/s

Currently used, lightweight version of the kernel, no lmem used (parallel bitslice only used when there’s enough smem for it), only a few bytes are transferred between host and device, about one second running time of each kernel call

GTX280 16 Mio results/s

GTX295 15.5 Mio results/s

My own conclusion is - the raw computing power of one GTX295 GPU is almost on par with the GTX280. The more global memory is used within the kernel (and I’m sure that applies to host/device transfers, too) the more the difference due to the higher bandwidth and RAM clock speed of the GTX280.

However, since the figures above perfectly scale from one up to eight GTX295 GPUs, this is the way we’re doing it.

If you have any specific questions, feel free to ask.

Best regards,


Dear joar,

thank you very much for your update! Many of us have been following your adventure :)

It’s great that you are having such success. I guess you are running XP 64? There is a hot spot for these GTX 295’s among the folding@home folks. There are a few sagas about trying to get Vista 64 to run stable clients on all four cores with dual 295’s. But more luck with XP 64. This made me wonder about which platform to set up. Ideal would be HPC Server 2008.

I guess these are results / second? And that more is better? And that the 15 Mio is for only 1 of the cores?

If so, any idea why the GTX 295 is so low? (Or is 15 much better than 25?)

Is this saying the GTX 280 is now comparatively slower or faster than before?

As we would expect (and hope for!).

Just to confirm, on (XP64?) you have seen the expected performance, and from all 4x2 cores?.

Have you heard of any similar success under Linux 64?

Again, thanks for the encouraging report!