GPU to GPU transfers most effective method?

Estimated time of arrival? :)

~ Quarter 1,2 ?

Fantastic, thanks Tim.

Elaborate please? You aren’t the first person to say it’s a bad idea, of course. BAR1 goes unused in all the cuda and openGL tests I have run (I can tell by tracing the MMIO), so there isn’t contention over it’s use. I can understand why you would steer people away from it, because one can do a lot more than just fiddle with pagetables. When there is no way to do what you want through the closed-source APIs, but the hardware is clearly capable of it, hacking the GPU directly is the best way.

This technique was easy to implement (after tracking down the appropriate registers), and has not impacted system stability. Latency and CPU utilization are way down, and overall system bandwidth is increased by virtue that the traffic is contained within the PCIe switch (so the root complex isn’t getting double-timed).

Hopefully the next release of CUDA will have an open memory system so it’s users won’t have to go to these great lengths just to map VRAM into PCI space. That functionality alone allows one to greatly streamline the processing pipeline in the system, getting rid of a lot of the I/O overhead once thought to be inherent in discrete GPUs, and unshackling the GPU from the CPU for a much more balanced heterogeneous system.

pagetable structs (for those interested):

/**

 * Page table entry

 */

struct nvPte_t

{

	/**

	 * Bit[0] Indicates if this PTE is present

	 */

	u64 Present : 1;

	

	/**

	 * @internal Bits[1-2]

	 */

	u64 Reserved1 : 2;

	

	/**

	 * Bit[3] Is this page read-only?  Otherwise it is read-write.

	 */

	u64 ReadOnly : 1;

	

	/**

	 * @internal Bit[4]

	 */

	u64 Reserved4 : 1;

	

	/**

	 * Bit[5] Target memory - 0 for VRAM, 1 for SYSRAM

	 */

	u64 Target : 1;

	

	/**

	 * @internal Bit[6]

	 */

	u64 Reserved6 : 1;

	

	/**

	 * Bits[7-9] Log2 of contiguous block size

	 * Contiguous blocks are aligned groups of 2, 4, 8, 16, 32, 64, or 128 

	 * contiguous entries mapped to contiguous physical addresses.

	 */

	u64 ContigBlock : 3;

	

	/**

	 * Bits[12-39] Physical address of page, shifted right 12 bits

	 */

	u64 PhyAddress : 28;

	

	/**

	 * Bits[40-46] Tiling format

	 * 0 means linear memory, no tiling

	 */

	u64 Tiling : 7;

	

	/**

	 * Bits[47-48] Compression type

	 * 0 indicates no compression

	 */

	u64 Compression : 1;

	

	/**

	 * @internal Bits[49-63]

	 */

	u64 Reserved49 : 15;

	

} __attribute__((packed)) typedef nvPte;

/**

 * Page directory

 */

struct nvPde_t

{

	/*

	 * Bit[0] Set if this PDE is present, otherwise the PDE is invalid

	 */

	u64 Present : 1;

	/*

	 * Bit[1] Indicates the size of all pages for all PTEs in this PDE

	 *        If set, the page size is 4K.  If not set, the page size is 64K.

	 *	  64K is referred to as large page mode, while 4K is small page mode.

	 */

	u64 PageSize : 1;

	/*

	 * Bits[2-3] The type of memory which this PDE uses, VRAM or SYSRAM.

	 *           If 0, the PDE uses VRAM.  If 1, the PDE uses SYSRAM.

	 */  

	u64 Target : 2;

	/*

	 * @internal Bit[4]

	 */

	u64 Reserved4 : 1;

	/*

	 * Bits[5-6] If in small page mode, this field indicates the number of PTEs in the PDE:

	 *

	 *              0x0 -> 0x20000 entries, covering 512MB of virtual address space

	 *              0x1 -> 0x8000  entries, covering 128MB of virtual address space

	 *              0x2 -> 0x4000  entries, covering  64MB of virtual address space

	 *              0x3 -> 0x2000  entries, covering  32MB of virtual address space

	 *

	 *           If in large page mode, there are always 0x2000 entries covering 512MB 

	 *           of virtual address space.

	 */

	u64 Entries : 2;

	/*

	 * @internal Bits[7-11]

	 */

	u64 Reserved7 : 5;

	/*

	 * Bits[12-39] Address of this PDE's table of PTEs, shifted right 12 bits (page-aligned)

	 */

	u64 Address : 28;

	

	/*

	 * @internal Bits[40-63]

	 */

	u64 Reserved40 : 24;

} __attribute__((packed)) typedef nvPde;

BAR1 space is a system resource that the kernel-mode driver may use at any time depending on what else is running on the machine. Beyond that, there are a lot of reasons you don’t want to use BAR1–even if NVIDIA were to do this, it wouldn’t work in the general case due to the PCIe spec and what’s required to be supported by chipsets. There are additional PCIe ordering issues that can make the GPU very unhappy. Even if you get around all of these, perf isn’t that good. Finally, BAR1 hacking may break at any time depending on what we decide to do in the driver with BAR1.

It’s cool that you’ve gotten this to work at all, but speaking as the guy who knows more about BAR1 and multi-GPU than anyone else around at this point, this is really not a path you want to go down.

The system primarily uses BAR3 - BAR1 goes unused. Of course, you can decide to do whatever you want with it in the future. The beauty is, with user-space drivers, so can I.

The standard-compliant PCIe switches (from IDT or PLX for example) I’ve used work just fine. You don’t need a NF200 from Nvidia to do this. But you can see how exposing this functionality in cuda has the potential to detract from SLI branding and NF200 sales.

It is definitely faster than buffering it through system RAM and synchronizing both GPUs with the CPU.

It may be true that accessing VRAM through BAR1 is slower than using the GPU’s DMA controller to do it, but sometimes another PCI device needs to be the one DMA’ing. For example, a 10GigE NIC, Infinniband, another GPU, etcetra. Generally the I/O devices are going to fire up their DMA engine(s) once they receive traffic.

I appreciate your comments and expertise in this area, thank you.

Our kernel mode driver is certainly not going to assume that there is someone else mapping allocations into BAR1. (ignoring that BAR1 is small, too)

I wasn’t referring to NF200 or whatever when I mentioned the PCIe spec. Plus, even having the appearance that it’s working doesn’t mean it’s going to work reliably (it won’t).

I understand what you want to do, but mapping BAR1 is simply not a good idea. I’m speaking from experience; if it were this easy to support cross-device transfers, it would have been exposed years ago.

Don’t worry about BAR1 being shared and the kernel-mode driver using it. Those are not the issue here. Nor is BAR1’s size (256MB is enough).

Is there something about the BARs of the GPU themselves that makes it unreliable or slow? Because that’s not my experience with it.

SLI has been around for years. And indeed, the staff had talked about this feature ‘coming soon’ in Cuda years ago.

I am very interested in the information of the page table layout and the bits in the PTE. How have you figured out this layout? How confident are you that this is the real implementation of the page table?

Javi