2 GTX295 SLI Nqueens project

XP 64bit, that’s correct.

There are several hardware caveats as well. Basically I would recommend sticking to the FASTRA setup, just replacing the graphics cards with the GTX295s.

Yes, Yes, Yes, Yes. ;)

I’m guessing that the heavy use of lmem (which is just a private space in global memory of the device) makes the difference so big.

The newer variation is optimized for the GTX295 and thus not using any lmem anymore (but also not using parallel bitslice in all of the places, anymore because of the limited smem). So, the implementation itself is slower. But since it doesn’t use any lmem it shows that GTX280 and GTX295 are (as expected) very close in computing power but very different in memory performance.

The kernel performance per GPU is the same, no matter how many cores (one to eight tested). So, yes, the overall performance with 4x2 GPUs is 8 times the performance of just one GPU.

No, not yet. Haven’t tried it (and not intending to) myself, either.

You’re welcome. I’m happy to contribute.

Best regards,


Success We have done it we are running on Ubuntu 8.10 64bit. What we did was wrote a dispersement algorithm that took the single computation and divided it to how ever many deices we had, in our case 4 card (2xGTX295), then using pthreads and mutex locks each card takes a portion of the load then combines te results. We have cut computation time by a factor of approx. 3.5. We are now attemting to use multiple towers and writing a socket program to divide the load up by using additional machines. We calculate with teh current setup we can complete a 25 N Queen in approx 58 days that could beat the record of 6 months.

Oh and we are also building a multi-GPU flop tester we are seeing ovr a TFlop with 2 GTX295

Ok here’s our code that we’re using for a quad GPU nqueens solver. This program should scale to more gpus with a bit of work. This basically partitions the board into chunks and gives each chunk to a different gpu. sorry windows users, this program won’t work for you, it uses posix threads. one thread per cpu which handles one gpu each. enjoy


There are decent pthreads ports for the Win32 API.

Dear herrbifi, johnj21 and joar,
Thanks to you all for the great news. Time to boot up a board-full and see what happens :yes:

Just to let everyone know we just finished running 22 queens in a time of 25hours, 34min, and 54seconds. For our next trick, we’re going to do some clustering and have 6gpus on 2 computers run 23queens. We might ask for a couple volunteer systems to crunch some queens, we’d like to do 25q in a reasonable amount of time. The way I’ve written my code, 25q can only be divided between 12GPU’s, we have 6, so it would be nice to have some 9800gtx or higher that can donate some time.

If I unserstand you well, this meens that it is impossible to drive the 2 GPUs with a device like a MSI gtx295 which is mounted in SLI ? I actually tried with the one installed on my system to execute DeviceQuery from the sdk and only one GPU is detected. Is there an alternate solution to bying another card ?

Turn off SLI mode. Then both GPUs on the card will be addressable via CUDA.

Yes ! Great, thanks a lot. It works indeed :thumbup:

Now the PC won’t boot anymore as long as SLI is turned off so i have to reinstall the drivers (with 180.22 drivers). I can cope with that turning it manually on and off at every computer start but I am surprised I didn’t find any mention of it. Maybe I should stick to older drivers ?


can you describe the workstation you used to host the 2 GTX295. Is it a self-made or a commercial workstation?


Yes we are using a ASUS Rampage II Extreme board, 6gb (3x2) Corsair XMS 3, I7-940 processor, 150 Raptor, Antec 900 case, Thermaltake V1 LGA1366 cooler, Nspire 850 modular P/S, and an LG Blueray burner. All running on Ubuntu 8.10