From NIC to GPU.

Yes, that’s a possibility. How big can the memory area be for cudaAllocHost, could I get 8GB? Will this be contiguous in physical memory? Probably not, So my kernel driver would have to maintain some form of translation lists. As mentioned earlier, I simply would prefer to tell both CUDA and my driver: Use the 8GB I have reserved for you at boot time.

Cheers,

peter

Yes, that’s a possibility. How big can the memory area be for cudaAllocHost, could I get 8GB? Will this be contiguous in physical memory? Probably not, So my kernel driver would have to maintain some form of translation lists. As mentioned earlier, I simply would prefer to tell both CUDA and my driver: Use the 8GB I have reserved for you at boot time.

Cheers,

peter

heheh 8GB, most likely not that much (I have been allocating 512mb reliably). And it certainly won’t be contigious.

You mentioned earlier that you are transfering data in 16mb buffers, would it suffice to have multiple 16mb buffers that are contigious? I have been allocating 8mb buffers that are contigious without much problem. Here’s my procedure for doing so:

void* contigAlloc = NULL;

Array<void*> badAllocs;

while(true)

{

	 void* virtual = cuMemHostAlloc( numPages * 4096 );

	 void* lastPhys = NULL;

	bool contiguous = true;

	 for( uint n=0; n < numPages; n++ )

	 {

			void*  currPhys = GetPhysicalAddress(virtual + n*4096);

			if( lastPhys + 4096 != currPhys )

			{

					contigious = false;

				   break;

			}

	  }

	  if( contiguous )

	  {

			contigAlloc = virtual;

			break;

	  }

	  else

			  badAllocs.Add(virtual);

}

for( uint n=0; n < badAllocs.GetLength(); n++ )

	   cuHostFree(badAllocs[n]);

return contigAlloc;

The non-contigous allocations aren’t free’d immediately so that Cuda won’t return the same ones right back to you.

heheh 8GB, most likely not that much (I have been allocating 512mb reliably). And it certainly won’t be contigious.

You mentioned earlier that you are transfering data in 16mb buffers, would it suffice to have multiple 16mb buffers that are contigious? I have been allocating 8mb buffers that are contigious without much problem. Here’s my procedure for doing so:

void* contigAlloc = NULL;

Array<void*> badAllocs;

while(true)

{

	 void* virtual = cuMemHostAlloc( numPages * 4096 );

	 void* lastPhys = NULL;

	bool contiguous = true;

	 for( uint n=0; n < numPages; n++ )

	 {

			void*  currPhys = GetPhysicalAddress(virtual + n*4096);

			if( lastPhys + 4096 != currPhys )

			{

					contigious = false;

				   break;

			}

	  }

	  if( contiguous )

	  {

			contigAlloc = virtual;

			break;

	  }

	  else

			  badAllocs.Add(virtual);

}

for( uint n=0; n < badAllocs.GetLength(); n++ )

	   cuHostFree(badAllocs[n]);

return contigAlloc;

The non-contigous allocations aren’t free’d immediately so that Cuda won’t return the same ones right back to you.

Yes, that’ll do. Thanks for the code sample. Once I get these 16MB blocks in user space, I would have to send the list somehow to the device driver, e.g. a loop and ioctl, and the driver would then maintain a list of physical addresses to be used to DMA into. Sound’s like a plan. And should the cuda process crahs, I would need to tell the device driver very quickly, e.g. by keeping a flag the user psace app has to write to and the driver reads from.

Cheers,

peter

Yes, that’ll do. Thanks for the code sample. Once I get these 16MB blocks in user space, I would have to send the list somehow to the device driver, e.g. a loop and ioctl, and the driver would then maintain a list of physical addresses to be used to DMA into. Sound’s like a plan. And should the cuda process crahs, I would need to tell the device driver very quickly, e.g. by keeping a flag the user psace app has to write to and the driver reads from.

Cheers,

peter

I’ve noticed that as time goes on and physical memory gets more and more fragmented, this method fails more often. Often, if you request a small amount of memory from cuMemHostAlloc(), like a few MB, if will only be contiguous for a few pages. However, if you request larger amounts, like 64 or 128, it can contain hundreds of contiguous pages. I know that NV’s internal resource manager caches allocations that it makes at system boot, so these smaller requests are always fragmented. So now I’m allocating large blocks and extracting the physically contiguous regions from those; I’ll let you know if it’s reliable.

I’ve noticed that as time goes on and physical memory gets more and more fragmented, this method fails more often. Often, if you request a small amount of memory from cuMemHostAlloc(), like a few MB, if will only be contiguous for a few pages. However, if you request larger amounts, like 64 or 128, it can contain hundreds of contiguous pages. I know that NV’s internal resource manager caches allocations that it makes at system boot, so these smaller requests are always fragmented. So now I’m allocating large blocks and extracting the physically contiguous regions from those; I’ll let you know if it’s reliable.

I wonder whether it would be feasible to hack this resource manager such that it uses memory reserved at boot time, e.g. “mem=4G memmap=8G$4G”, through a driver option. Only this way you can guarantee that the memory is available. Say, if I reserve 8GB, then it should be possible to get 512 blocks of 16MB. Depending on the resource allocator, but not unthinkable, susequent cudaMalloHost() calls could return subsequent physical memory blocks. Downside, it would still require a nv driver hack, which I am less inclined to do.

peter

I wonder whether it would be feasible to hack this resource manager such that it uses memory reserved at boot time, e.g. “mem=4G memmap=8G$4G”, through a driver option. Only this way you can guarantee that the memory is available. Say, if I reserve 8GB, then it should be possible to get 512 blocks of 16MB. Depending on the resource allocator, but not unthinkable, susequent cudaMalloHost() calls could return subsequent physical memory blocks. Downside, it would still require a nv driver hack, which I am less inclined to do.

peter

There are only a few places in the low-level driver that allocate memory using get_free_pages(), so that is a possibility I think (check nv-vm.c). However, I would be concerned about side-effects occurring (for one, only memory < 4GB will work for Nvidia cards). If I can’t cull the contiguous regions I need from user-space, I’ll probably try that next.

There are only a few places in the low-level driver that allocate memory using get_free_pages(), so that is a possibility I think (check nv-vm.c). However, I would be concerned about side-effects occurring (for one, only memory < 4GB will work for Nvidia cards). If I can’t cull the contiguous regions I need from user-space, I’ll probably try that next.

Oooh, that’s not going to work then, even on a 64bit machine? lspci -v reveals

07:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060 / Tesla S1070] (rev a1)

	Memory at fa000000 (32-bit, non-prefetchable) 

	Memory at dc000000 (64-bit, prefetchable) 

	Memory at f8000000 (64-bit, non-prefetchable) 

So I would have thought that one of these memory windows is used for DMA. If a 32bit window on the GPU is used for DMA from host to GPU, does this mean, that the host memory window also has to be in the 32bit region, i.e. < 4GB?

Well, then I’ll have to go for plan B, i.e. have several blocks in the 32bit region in a ring buffer for the processing pipeline, and some other mechanism to stream or memcpy data from the pipeline to 64bit reserved memory for long term capture.

Cheers,

peter

Oooh, that’s not going to work then, even on a 64bit machine? lspci -v reveals

07:00.0 3D controller: nVidia Corporation GT200 [Tesla C1060 / Tesla S1070] (rev a1)

	Memory at fa000000 (32-bit, non-prefetchable) 

	Memory at dc000000 (64-bit, prefetchable) 

	Memory at f8000000 (64-bit, non-prefetchable) 

So I would have thought that one of these memory windows is used for DMA. If a 32bit window on the GPU is used for DMA from host to GPU, does this mean, that the host memory window also has to be in the 32bit region, i.e. < 4GB?

Well, then I’ll have to go for plan B, i.e. have several blocks in the 32bit region in a ring buffer for the processing pipeline, and some other mechanism to stream or memcpy data from the pipeline to 64bit reserved memory for long term capture.

Cheers,

peter

I’m not too sure, I’ve seen different tidbits in the driver that have me confused. For now, I’m keeping things 32-bit for simplicity. Perhaps someone with better knowledge of the internals could help.

// nv-vm.c line 457

// for amd 64-bit platforms, remap pages to make them 32-bit addressable

// in this case, we need the final remapping to be contiguous, so we

// have to do the whole mapping at once, instead of page by page

if ((ret = nv_sg_map_buffer(dev, at->page_table, (void *)virt_addr, at->num_pages)))

{

	  if (ret < 0)

	  {

	  nv_printf(NV_DBG_ERRORS,

					"NVRM: VM: nv_vm_malloc_pages: failed to remap contiguous "

					"memory\n");

			}

	  NV_FREE_PAGES(virt_addr, at->order);

	  return -1;

}
  • DMA memory is allocated with get_free_pages using GFP_DMA32 flag
// nv.c, line 4645

nv_printf(NV_DBG_ERRORS,

				"NVRM: This is a 64-bit BAR mapped above 4GB by the system\n"

				"NVRM: BIOS or the Linux kernel.  The NVIDIA Linux/x86\n"

				"NVRM: graphics driver and other system software components\n"

				"NVRM: do not support this configuration.\n");

I’m not too sure, I’ve seen different tidbits in the driver that have me confused. For now, I’m keeping things 32-bit for simplicity. Perhaps someone with better knowledge of the internals could help.

// nv-vm.c line 457

// for amd 64-bit platforms, remap pages to make them 32-bit addressable

// in this case, we need the final remapping to be contiguous, so we

// have to do the whole mapping at once, instead of page by page

if ((ret = nv_sg_map_buffer(dev, at->page_table, (void *)virt_addr, at->num_pages)))

{

	  if (ret < 0)

	  {

	  nv_printf(NV_DBG_ERRORS,

					"NVRM: VM: nv_vm_malloc_pages: failed to remap contiguous "

					"memory\n");

			}

	  NV_FREE_PAGES(virt_addr, at->order);

	  return -1;

}
  • DMA memory is allocated with get_free_pages using GFP_DMA32 flag
// nv.c, line 4645

nv_printf(NV_DBG_ERRORS,

				"NVRM: This is a 64-bit BAR mapped above 4GB by the system\n"

				"NVRM: BIOS or the Linux kernel.  The NVIDIA Linux/x86\n"

				"NVRM: graphics driver and other system software components\n"

				"NVRM: do not support this configuration.\n");

I just came across this paper “Implementation of an SDR System Using Graphics Processing Unit” in the March issue of IEEE communication magazine, and the authors did exactly what this thread wanted to achieve (using the architecture in Figure 1). The paper itself doesn’t elaborate much on implementation details though.

I’m curious - I’m considering moving some processes over to GPUs but we have an odd setup. First off, we don’t run any recognizable OS (we’re not even close to POSIX, but the Linux community is the closest to a home town we’ve got.) Second, our systems use PCIe as a backbone, so there are a large number of devices on the PCIe network, and there is no real root device with system memory. My question is, if I wanted to write my own very minimal driver for our system, would it be possible? I was under the impression that the Linux driver for NVIDIA GPU was closed, but perhaps it’s only partially closed? I’m also curious if the PCIe protocol is open, and if not, if anyone thinks it’s simple enough to reverse it using a PCIe interposer. I haven’t studied the problem well enough to know all that’s involved, but I wouldn’t need all feature of a full driver, just enough to start the GPU device up, load some code, and then send data and get results when it’s done executing. What we’re doing is mostly compute intensive and not very data intensive in comparison, but latency is a concern so I’d much prefer not to route things out of our PCIe backbone.

I have an application where the argument for PCIe peer-to-peer transfers vs. pinned memory transfers is obvious.

We use multiple data acquisition boards in high PCIe slot-count systems. We’re currently researching the use of one CUDA board for each acquisition board to process data before sending it to the host for display or storage. The more acquisition+CUDA board pairs we can utilize in each system the better.

On our current system, I believe the data flow for the pinned memory transfer method is as follows for a simple case where the data is stored to host RAM (All devices and switches are PCIe Gen 2):
AcqBrd->PCIe x8 Switch->MemCtrl->Host RAM->MemCtrl->PCIe x8 Switch->CUDA Board->PCIe x8 Switch->MemCtrl->Host RAM

Here it is using peer-to-peer transfers:
AcqBrd->PCIe x8 Switch->CUDA Board->PCIe x8 Switch-MemCtrl->Host RAM

The major benefit for me here is that the host side traffic of the PCIe x8 switch is utilized far less. At a rate of 800 MB/s off of the acquisition board and the same data rate coming off of the CUDA board, the number of board pairs I can use is cut in half at best with the pinned method due to the excessive use of that PCIe x8 switch. If we have four pairs of boards, we’re looking at 3,200 MB/s of CUDA board upstream traffic on the host side of the PCIe switch, which is not possible if that switch is also sending acquisition board data upstream as well (PCIe x8 Gen2 theoretical throughput is 4 GB/s).

The argument Tim made earlier is probably valid for certain cases, where some latency is the only real drawback, but where other system resources are critical the peer-to-peer method can be a huge gain. Using the pinned memory method the host memory controller, host RAM, and PCIe resources are all taxed more to some degree, which also means that other host work potential is diminished as well (such as further data processing on the host and then storage to disk via PCIe RAID controller(s)).

That said, is there any officially supported way to do peer-to-peer PCIe transfers to a CUDA board currently, or does NVidia have it on the roadmap yet at least? Please don’t hate me, but my application is for Windows, so mods to the kernel or driver are not feasible, that’s why I’m interested in officially supported solutions.

Thanks,
Brian

I have an application where the argument for PCIe peer-to-peer transfers vs. pinned memory transfers is obvious.

We use multiple data acquisition boards in high PCIe slot-count systems. We’re currently researching the use of one CUDA board for each acquisition board to process data before sending it to the host for display or storage. The more acquisition+CUDA board pairs we can utilize in each system the better.

On our current system, I believe the data flow for the pinned memory transfer method is as follows for a simple case where the data is stored to host RAM (All devices and switches are PCIe Gen 2):
AcqBrd->PCIe x8 Switch->MemCtrl->Host RAM->MemCtrl->PCIe x8 Switch->CUDA Board->PCIe x8 Switch->MemCtrl->Host RAM

Here it is using peer-to-peer transfers:
AcqBrd->PCIe x8 Switch->CUDA Board->PCIe x8 Switch-MemCtrl->Host RAM

The major benefit for me here is that the host side traffic of the PCIe x8 switch is utilized far less. At a rate of 800 MB/s off of the acquisition board and the same data rate coming off of the CUDA board, the number of board pairs I can use is cut in half at best with the pinned method due to the excessive use of that PCIe x8 switch. If we have four pairs of boards, we’re looking at 3,200 MB/s of CUDA board upstream traffic on the host side of the PCIe switch, which is not possible if that switch is also sending acquisition board data upstream as well (PCIe x8 Gen2 theoretical throughput is 4 GB/s).

The argument Tim made earlier is probably valid for certain cases, where some latency is the only real drawback, but where other system resources are critical the peer-to-peer method can be a huge gain. Using the pinned memory method the host memory controller, host RAM, and PCIe resources are all taxed more to some degree, which also means that other host work potential is diminished as well (such as further data processing on the host and then storage to disk via PCIe RAID controller(s)).

That said, is there any officially supported way to do peer-to-peer PCIe transfers to a CUDA board currently, or does NVidia have it on the roadmap yet at least? Please don’t hate me, but my application is for Windows, so mods to the kernel or driver are not feasible, that’s why I’m interested in officially supported solutions.

Thanks,
Brian