Using more than 1 CUDA card at a time. Physics simulations flat out flying on GPU

Greetings,

Visited with my advisor yesterday, and he’s using CUDA cards with great success for simulations. He’s found a 200x speedup over using the CPU, once all the operations are sent over to the GPU. His claim is that most of the standard benchmarks use a lot more I/O and transfer data more often than his simulations, which just flat out fly on the GPU.

There’s one issue that they are having, though. They can’t get the cards to work together. Machine 1 is small and simple, with one GTX285 (2 GB). Machine 2 has a server motherboard (he claimed 4 full x16 PCI-e 2.0 paths,) and 4 GTX285s (and a honking big case, etc.) The only way they have gotten top performance out of the four cards is to send a separate job to each card. Of course, this would still subject the work to more I/O, but if it’s done cleverly, it would enable bigger simulations.

For those of you working with more than one card, how do you get them to talk to each other? And do you see speed slowdowns relative to using 1 card?

Finally, I was advised that putting a very cheap ($20) video card for video output to free the GTX285 up to just do math avoided the big performance hit for doing math and graphics on the same card. In my mind, this would make using a x16/x16/x4 motherboard for a 2 GPGPU system better than a x16/x8/x8 solution, because you could put the cheap card in the x4 slot and get full bandwidth to the 2 GPGPUs. Of course, your cheap card might be regular PCI, AGP, or whatever, but having 2 dedicated x16 ports makes good sense to me now, and seems to be the sweet spot in price/performance/setup effort.

I regret that I don’t have more specifics, but if anyone out there has approximate answers to these issues, I can get you in touch with the interested parties.

Regards,
Martin

I use multiple cards in Linux… though mostly 2 GTX295s meaning 4 GPUs.
There’s a linear speedup with 4 cards, completely.

Yes, you’re right that getting an extra cheap-o board to handle display is a good idea. It’s well worth it. Just find a cheap used 8600GT or something that doesn’t use much power or
take much room. We can deal with the laggy display if needed (our kernels all run in less than 250ms) but just the CONVIENIENCE of avoiding the chunky display, the worry of timeouts, etc, was annoying.

As for getting full performance from 4 cards, if you’re not getting it now, you need to understand where your bottleneck is. It’s likely either PCIe speeds or CPU scheduling.
You say there’s not much I/O, so it’s likely your CPU scheduling. Do you have a quadcore CPU? Are you doing ANY work on the CPU that makes a GPU sit idle?
I initially had the CPU report a trivial little progress bar after each kernel completion. But it turns out that was a bad idea, that was slowing down processing by 20% just because the CPu had to wait, then it had to display, then it had to schedule the next kernel. Bad idea. It’s better to keep kernels queued up so they go one after the other with no delay.
Streams are your friend here, too… keep those queues loaded with work!

I ended up making a progress display by using zero-copy memory which worked even better than the kernel launch method anyway… each GPU basically just reported its progress as it finished work blocks, and the CPU never had to interrupt, it just read the results (and could measure progress by the EXISTENCE of those results).

But speedups will be hugely application dependent. You imply that your GPUs need to talk to each other… that alone is perhaps the biggest complication.

CUDA doesn’t have any intrinsic multi-gpu facility. If you want your code to use more than one GPU, you have to devise a scheme for it yourself. I use a pthreads based arrangement for runtime API code (including CUBLAS) written in C which launches a persistent thread for each physical GPU. Those threads hold their GPU context for the lifetime of the application and are fed with work via a pthreads condition variable/mutex and function pointers, with barriers for synchronization between host threads. Memory transfers have to be explicitly managed by the host code, I use my own host side memory manager working on a large pre-allocated block of GPU memory for that. It isn’t that difficult for many classes of tasks, but true “interprocess” communication between GPUs is hard to do, and I have avoided it thus far. Using a pair of GPUs in two x16 slots, I get 2x speed up over the single card cases in things like dense matrix algebra (in fact, some of my code actually runs faster than 2x speed up, because offloading some of the host-GPU copying onto an independent thread allows additional latency hiding).

So my experience matches avidday… use pthreads, and one PERSISTENT thread per GPU.

There’s still lots of complications beyond that basic start structure, but they’ll likely be app dependent based on your data dependencies. Know your pthread calls, you’ll likely use condition variables to have threads wait and wake up.

Hi,

I wonder if you could expand on this topic in more detail. I built a large project with many CUDA-based functions that run on a host with a single CUDA card. Now I’ve just acquired a multi-GPU system and I’m coming to terms with what needs to happen to rearrange my project so it runs on multiple CUDA cards in parallel. I settled on a model whereby, on start up, my application auto-detects total number of CUDA cards that meet the minimum requirements, allocates resources in a global data structure that keeps track of GPU ids and pointers to allocated resources on each GPU. When one of my CUDA-based function is called, it looks up the global data structure, decides if there is work to be done, uses pthreads to divvy up the work amongst the GPU’s. That was the intent, until I discovered the rule about “1 host thread per GPU”, “no change in GPUid for a given thread”, and when the theads end, so do the resources on the GPU card. I’ve see a few posts that describe the use of pthreads condition varialbes and mutex to get things working. My familiarity with posix threads is however somewhat limited. I guess that’s about to change. I wonder if you could do one of two things for the folks following this topic.

  1. Can you point us to some literature that can shed some light on a basic programming model.

  2. I attach a small C code that I put together to try to come to terms with the basics of this topic. It compiles on linux and uses pthreads. On startup, it figures out the number of CUDA cards present, initializes a small array, divides the number of array elements evenly (more or less) amongst the gpus. It then starts a bunch of threads to move data to the gpu’s, resets the array elements on the host to zero, then spawns another bunch of threads to get the data back from the GPU’s. There is a CUDA variable at the top of the file. If set to “0”, all the work is done on the host. I added this in to convince myself there were no underlying errors with the various pointers. If set to “1”, it works with the CUDA cards. As I indicated, it doesn’t do what I intended. Until earlier today I knew nothing about pthreads condition variables But, it was my hope that you could take a look and point out where to fix things and in doing so point me in a sensible direction.

Thanks, Richard

Ok. Let’s try this again. Hmmmm, browse for the file, ok, upload, ok,

sample.cu (5.84 KB)

If you can use boost in your project, then GPUWorker can hide all the details from you:

[url=“https://codeblue.umich.edu/hoomd-blue/trac/browser/trunk/libhoomd/utils/GPUWorker.cc”]https://codeblue.umich.edu/hoomd-blue/trac/...ls/GPUWorker.cc[/url]
[url=“https://codeblue.umich.edu/hoomd-blue/trac/browser/trunk/libhoomd/utils/GPUWorker.h”]https://codeblue.umich.edu/hoomd-blue/trac/...ils/GPUWorker.h[/url]

(This is part of a larger project, but GPUWorker is a self-contained class.)

Well. that’s a partial solution. I got a hold of GPUWorker sources from HOOMD-0.8.2. If I add it to my little sample code, set the gpu id’s manually, make a few mods I can get it to work fine. For my bigger application, where I want it to automatically get the id’s of available GPU’s with compute capability 1.3 or greater, that is, where I intend to dynamically allocate a GPUWorker vector, then I run into troubles, which seem to be connected with changes that were introduced with CUDA2.3. There’s a thread to that effect elsewhere on the CUDA forum. Do you know if there is a version of GPUWorker about that is compatible with CUDA3.0(beta)?

I have my own pthreads based multi-gpu framework. The startup code uses the driver API to enumerate every CUDA GPU in the system, then launch a host thread for each. It uses a function passed to each thread to identify the capability of the GPUs:

void gpuIdentify(struct gpuThread * g)

{

	char compModeString[maxstring];

	char identstring[maxstring];

	gpuAssert( cuDeviceGet(&g->deviceHandle, g->deviceNumber) );

	gpuAssert( cuDeviceGetName(g->deviceName, maxstring, g->deviceHandle) );

	gpuAssert( cuDeviceGetProperties(&g->deviceProps, g->deviceHandle) );

	gpuAssert( cuDeviceTotalMem(&g->deviceMemoryTot, g->deviceHandle) );

	gpuAssert( cuDeviceGetAttribute(&g->deviceCompMode, CU_DEVICE_ATTRIBUTE_COMPUTE_MODE, g->deviceHandle) );

	gpuAssert( cuDeviceComputeCapability(&g->deviceCC[0], &g->deviceCC[1], g->deviceHandle) );

	switch (g->deviceCompMode) {

	case CU_COMPUTEMODE_PROHIBITED:

		sprintf(compModeString,"Compute Prohibited mode");

		break;

	case CU_COMPUTEMODE_DEFAULT:

		sprintf(compModeString, "Normal mode");

		break;

	case CU_COMPUTEMODE_EXCLUSIVE:

		sprintf(compModeString, "Compute Exclusive mode");

		break;

	default:

		sprintf(compModeString, "Unknown");

		break;

	}

	sprintf(identstring, "%d %s, %d MHz, %d Mb, Compute Capability %d.%d, %s", 

			g->deviceNumber, g->deviceName, g->deviceProps.clockRate/1000,

			g->deviceMemoryTot / constMb, g->deviceCC[0], g->deviceCC[1], compModeString);

	gpuDiagMsg(stderr, identstring, __FILE__, __LINE__);

}

It will identify whether the GPUs are compute 1.3 and whether they are compute permitted/exclusive/prohibited. You will have to use your imagination as to what the structure looks like, but the subsequent per thread code that parses it looks something like this:

void gpuInitialise(struct gpuThread *g)

{

	char initmsg[maxstring];

	/*

	 * Check whether the device is compute prohibited,

	 * and skip it if it is

	 */

	if (g->deviceCompMode == CU_COMPUTEMODE_PROHIBITED) {

		g->deviceAvail = 0;

		return;

	}

	/*

	 * Check the compute capability and skip it if it

	 * didn't report 1.3

	 */

	if ( !((g->deviceCC[0] == 1) && (g->deviceCC[1] == 3)) ) {

		g->deviceAvail = 0;

		return;

	}

	/* Attempt to establish a runtime API context */

	if ( cudaSetDevice(g->deviceNumber) != cudaSuccess) {

		g->deviceAvail = 0;

		return;

	}

..........

}

so that for threads assigned GPUs which can be used, a runtime api context is established. I used nvidia-smi to keep our cluster GPUs in compute exclusive mode so that every launch of this code will only try and establish contexts on free devices, and the calling host code can be passed a GPU number to try and use. I don’t know whether this helps or not, but it might give you some ideas about how to work out a method which suits your code.

Actually, I was able to put together a rather straightforward approach based on the GPUWorker class. For completeness I attach a revision to my little sample code should it prove useful to anyone stumbling on this thread. Combined with GPUWorker, it works fine. CUDA version (2.2, 2.3, 3.0) is not an issue. It looks like my problem is solved. Thank you Mr. Anderson for that little gem.

multiGPUsample.cu (5 KB)

Hello: You make it sound as if installing a second card to handle display is a straightforward process. But … I’m running ubuntu 9.10, and have just installed a geforce 9800 alongside a geforce 9600 card, with the explicit intention of using the 9600 card to run my machine’s graphics etc, and using the 9800 for numerical computation. I cannot for the life of me figure out how make the 9600 handle the display only …

Originally, the motherboard (asus M4A785TD) had only the 9600 in the first PCI slot; no problems (except sluggish display under cuda). I’ve then installed the 9800 in the first PCI slot, moving the 9600 to the second PCI slot. The nvida utility ./deviceQuery shows that both cards are seen. But, I get no video output at all from the 9600 card when it is in the second slot, and hence cannot use it to run the display!!

All suggestions gratefully received!

DFR1947

You probably need to specify the card you want to use for the display by its PCI-e address. If the PCI-e bus enumerates like this:

00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual slot PCI-e_GFX and HT3 K8 part

00:02.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (external gfx0 port A)

00:07.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port D)

00:09.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port E)

00:0b.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (external gfx1 port A)

00:11.0 SATA controller: ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]

00:12.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller

00:12.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller

00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller

00:13.0 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI0 Controller

00:13.1 USB Controller: ATI Technologies Inc SB700 USB OHCI1 Controller

00:13.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller

00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 3c)

00:14.1 IDE interface: ATI Technologies Inc SB700/SB800 IDE Controller

00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA)

00:14.3 ISA bridge: ATI Technologies Inc SB700/SB800 LPC host controller

00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge

00:14.5 USB Controller: ATI Technologies Inc SB700/SB800 USB OHCI2 Controller

00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration

00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map

00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller

00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control

00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control

01:00.0 VGA compatible controller: nVidia Corporation Device 05e6 (rev a1)

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)

04:00.0 VGA compatible controller: nVidia Corporation Device 05e6 (rev a1)

05:0e.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link)

(two VGA controllers in this case at 01:00.0 and 04:00.0), you edit xorg.conf file, so that the device entry looks something like this:

Section "Device"

	Identifier	 "Device0"

	Driver		 "nvidia"

	VendorName	 "NVIDIA Corporation"

	BoardName	  "GeForce GTX 275"

	BusID		  "PCI:1:0:0"

	Option		 "Coolbits" "1"

EndSection

where BusID is the PCI-e address of the card you want to use for display.

Hello: Thanks for the considered reply. In retrospect, I realized that since my problem (no signal from second Nvidia card) showed up at BIOS time, it could have nothing to do with my xorg.conf set up (I initially thought this to be the case). It looks more likely that my power supply (450w) is insufficient for the two cards (9600GT + 9800GT), which I’m about to fix.

Anyway, thank you again for the suggestion: your info is likely to be very useful in any event.

Cheers, David