Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel,

PeterW · June 19, 2010, 1:21pm

The Xilinx driver does not provide an ioctl, but I can add one. There are two cases, though:

The ioctl passes a pointer obtained from a cudaMallocHost into the kernel, the ioctl file op then sets of a DAM transfer directly into this area. The only question I have here: how does the kernel obtain the kernel vitrual memory address or he physical address from the pointer obtained by cudaMallocHost? Imagine the program that requested the cudaMallocHost memory dies while the driver is still DMA-ing data into it.
use a read system call, DMA into a slab and then copy_to_user(). So one additional memcpy.
the issue is: Let’s have a piece of reserved, contiguous kernel memory and have it use by both cuda and my FPGA driver. So any cudaMemcpy would use DMA to the GPU, and I can DMA data from the FPGA into it, too.
peter

gogogpu · June 28, 2010, 12:48pm

Peter, so far I have gone down the path of use case #1. It seems like what we are trying to achieve is quite similar, eg. pipeline an FPGA → GPU with lowest latency possible. In fact, I’ve even heard of some people in this region that are starting to use Nvidia GPU for radar processing ;)

I am also using a Xylinx V5. I ioctl the buffers allocated through cuMemHostAlloc() to our driver, lookup the vma through find_vma(), and calculate the physical address from that. The problem, like you pointed out, is that if the app crashes and the NV driver removes the mapping while the FPGA is still writing to it, boom the machine’s dead. I’m going to try adding references to the vm_area_struct so hopefully the kernel will keep the mapping around even after NV discards it.

I was hoping that there would be some improvements in Cuda 3.1 for doing this kind of thing, the release notes even made a big point about it, alas there is nothing new in the toolkit.

You guys mentioned earlier that I said you could modify the low-level NV driver to accomplish this, that would be approaching this problem from the other end. I have implemented some changes in their allocator, but got this way up and running first. It was easier to pass memory down into a driver you have the source to then try to pass it up from the depths of the Nvidia driver and out through Cuda.

PeterW · June 29, 2010, 10:20am

Well, that’s our plan, ~2GB/s in chunks of 16MB every 8.4ms.

It is certainly one possibility, but it strikes me as error prone if a user space program allocates something a kernel driver is then relying on. It should really be the other way round.

From a business perspective, it is not desirable that an application relies on modification in NVidia’s driver. Not only would any warranty be waived, but also the same patch has to be applied and tested for every CUDA release.

The perfect compromise would be if Nvidia would offer something like

cudaMallocHost(void **ptr_out, void *ptr_in, size_t size)

ptr_in could be a physical address of reserved memory used by the driver. Alternatively the device driver allocates some contiguous kernel memory, a user space application mmaps it into user vitural memoryna and passes it on to cudaMallocHost which would simply remap it into the driver’s kernel virtual memory space.

The result is that several parties access the same memory window, the device driver, the nvidia driver and a user space program. The user will have to make sure that they do not step on each other’s feet.

Cheers,

peter

PeterW · June 29, 2010, 10:20am

Well, that’s our plan, ~2GB/s in chunks of 16MB every 8.4ms.

It is certainly one possibility, but it strikes me as error prone if a user space program allocates something a kernel driver is then relying on. It should really be the other way round.

From a business perspective, it is not desirable that an application relies on modification in NVidia’s driver. Not only would any warranty be waived, but also the same patch has to be applied and tested for every CUDA release.

The perfect compromise would be if Nvidia would offer something like

cudaMallocHost(void **ptr_out, void *ptr_in, size_t size)

ptr_in could be a physical address of reserved memory used by the driver. Alternatively the device driver allocates some contiguous kernel memory, a user space application mmaps it into user vitural memoryna and passes it on to cudaMallocHost which would simply remap it into the driver’s kernel virtual memory space.

The result is that several parties access the same memory window, the device driver, the nvidia driver and a user space program. The user will have to make sure that they do not step on each other’s feet.

Cheers,

peter

gogogpu · June 29, 2010, 12:50pm

Yes, a cudaMallocHost() that take in physical ptrs would be the ideal solution, like you have said. There was talk of this feature “coming soon” here on the forums more than a year ago, after which any reference to it promptly disappeared. It looks like they planned to expose an API like this in the 3.1 release (GPUDirect), but there must have been some kind of mix-up because there isn’t anything new in the toolkit or the drivers.

I found that my system was unstable because I was doing the virtual → physical translation incorrectly. Previously, I had used find_vma() and calculated the physical address as an offset from that. I found the correct way to do it in nv.c in the driver, in the function nv_kern_get_physical() - you have to walk the page-tables. Now everything runs smoothly, even if the process crashes (as long as I stop the FPGA shortly thereafter - the kernel will close the dev file for you). I also find patching the Nvidia driver to be undesirable in many aspects and am glad that this approach works. For my experiments though, it has been very useful (mostly enabling debugging and adding printk’s).

8.4ms is ambitious, good luck! In my experience, the latency requirements for audio (1-5ms) < ladar (10-20ms) < video (30ms), all of which are now possible with ZeroCopy. Nvidia will maintain that direct GPU communication is wholy unnessessary, but I know they plan on this at some point and are just taking it slow at the moment.

gogogpu · June 29, 2010, 12:50pm

Yes, a cudaMallocHost() that take in physical ptrs would be the ideal solution, like you have said. There was talk of this feature “coming soon” here on the forums more than a year ago, after which any reference to it promptly disappeared. It looks like they planned to expose an API like this in the 3.1 release (GPUDirect), but there must have been some kind of mix-up because there isn’t anything new in the toolkit or the drivers.

I found that my system was unstable because I was doing the virtual → physical translation incorrectly. Previously, I had used find_vma() and calculated the physical address as an offset from that. I found the correct way to do it in nv.c in the driver, in the function nv_kern_get_physical() - you have to walk the page-tables. Now everything runs smoothly, even if the process crashes (as long as I stop the FPGA shortly thereafter - the kernel will close the dev file for you). I also find patching the Nvidia driver to be undesirable in many aspects and am glad that this approach works. For my experiments though, it has been very useful (mostly enabling debugging and adding printk’s).

8.4ms is ambitious, good luck! In my experience, the latency requirements for audio (1-5ms) < ladar (10-20ms) < video (30ms), all of which are now possible with ZeroCopy. Nvidia will maintain that direct GPU communication is wholy unnessessary, but I know they plan on this at some point and are just taking it slow at the moment.

PeterW · June 29, 2010, 2:14pm

Let’s hope for a 3.2 with this feature added (please, Nvidia). And perhaps cuda 3.2 and SDK will install without hacks on ubuntu 10.04, too.

The total latency incurred by the processing pipeline will be greater, but that’s less of a concern. The FGPA device driver’s task is to shift data blocks from the FGPA card to RAM with a time budget of 8.4ms, notify another process that will asynchronously copy the new data block onto the GPU, and immediately start processing the previous block with also a budget of 8.4ms. My concern is the deterministic execution of the pipeline.

Cheers,

peter

PeterW · June 29, 2010, 2:14pm

Let’s hope for a 3.2 with this feature added (please, Nvidia). And perhaps cuda 3.2 and SDK will install without hacks on ubuntu 10.04, too.

The total latency incurred by the processing pipeline will be greater, but that’s less of a concern. The FGPA device driver’s task is to shift data blocks from the FGPA card to RAM with a time budget of 8.4ms, notify another process that will asynchronously copy the new data block onto the GPU, and immediately start processing the previous block with also a budget of 8.4ms. My concern is the deterministic execution of the pipeline.

Cheers,

peter

gogogpu · June 29, 2010, 2:55pm

We already know that this is an absolutely essential feature for Nvidia in the HPC market; GPU supercomputing clusters were left basically useless until they wrote the Infiniband driver with Mellanox. And people have been talking about it for a while, so hopefully it will be released soon. It’s a question of [a] opening it up for general use and [b] profitability of ‘GPUDirect’. In the first case, there are obvious concerns for abuse and malicious code, and in the second, it’s kind of like how the Quadro/Tesla drivers have special functionality over the Geforce drivers. Speaking of Quadro vs. Geforce, an awesome feature of Cuda 3.1 is the inclusion of PCI bus and device ID in the cuda device caps. Before, it was practically impossible to pair up a Geforce OpenGL context and Cuda context in multi-GPU systems. Now you can do it through NVAPI (Windows) or libXNVCtrl (Linux). Before I had to do it by executing a bunch of stuff through Cuda and monitoring temperatures through libXNVCtrl :P

gogogpu · June 29, 2010, 2:55pm

We already know that this is an absolutely essential feature for Nvidia in the HPC market; GPU supercomputing clusters were left basically useless until they wrote the Infiniband driver with Mellanox. And people have been talking about it for a while, so hopefully it will be released soon. It’s a question of [a] opening it up for general use and [b] profitability of ‘GPUDirect’. In the first case, there are obvious concerns for abuse and malicious code, and in the second, it’s kind of like how the Quadro/Tesla drivers have special functionality over the Geforce drivers. Speaking of Quadro vs. Geforce, an awesome feature of Cuda 3.1 is the inclusion of PCI bus and device ID in the cuda device caps. Before, it was practically impossible to pair up a Geforce OpenGL context and Cuda context in multi-GPU systems. Now you can do it through NVAPI (Windows) or libXNVCtrl (Linux). Before I had to do it by executing a bunch of stuff through Cuda and monitoring temperatures through libXNVCtrl :P

PeterW · June 29, 2010, 3:10pm

Seconded; if you have 1TFLOP processing power provided by one Tesla C1060, you can make use of it by giving it a computational task with little external input such as a numerical simulations. Or, you can make use of it by crunching data, in which case you have to get data onto the system. And data have to come from somewhere, either being produced by the host (unlikely), or by some peripheral device (more likely), in which case transfer from a peripheral device onto the GPU becomes essential. The GPU constitutes one link in a processing chain, and the chain is as strong as the weakest link, quite likely this is a deterministic and efficient data throughput. I would expect nvidia drivers to support the concept of data processing pipelines.

Cheers,

peter

PeterW · June 29, 2010, 3:10pm

Seconded; if you have 1TFLOP processing power provided by one Tesla C1060, you can make use of it by giving it a computational task with little external input such as a numerical simulations. Or, you can make use of it by crunching data, in which case you have to get data onto the system. And data have to come from somewhere, either being produced by the host (unlikely), or by some peripheral device (more likely), in which case transfer from a peripheral device onto the GPU becomes essential. The GPU constitutes one link in a processing chain, and the chain is as strong as the weakest link, quite likely this is a deterministic and efficient data throughput. I would expect nvidia drivers to support the concept of data processing pipelines.

Cheers,

peter

julien38 · June 30, 2010, 8:54am

Dear all

I also have a real-time data processing application. I typically need to process 200kB of input data every ms, and output the ~4kB result packet to the outside world. I only have ~100us to process the last input bytes, so I really need pipeline !

Does anybody knows how NVIDIA digital video pipeline solutions works ?
( [url=“http://www.nvidia.com/object/quadro_dvp.html”]http://www.nvidia.com/object/quadro_dvp.html[/url] )

Apparently all data transfer are done through PCIe. It should implement solutions
that we could reuse for other RT data processing.

julien38 · June 30, 2010, 8:54am

Dear all

I also have a real-time data processing application. I typically need to process 200kB of input data every ms, and output the ~4kB result packet to the outside world. I only have ~100us to process the last input bytes, so I really need pipeline !

Does anybody knows how NVIDIA digital video pipeline solutions works ?
( [url=“Quadro & RTX Professional Design & Visualization Solutions | NVIDIA”]http://www.nvidia.com/object/quadro_dvp.html[/url] )

Apparently all data transfer are done through PCIe. It should implement solutions
that we could reuse for other RT data processing.

PeterW · June 30, 2010, 9:42am

Hello Julien

from what I have seen so far, there are several types of GPU processing applications:

Little input data, lots of output data; e.g. graphics rendering of a physical model
Little input data, little output data; e.g. numerical simulation
Lots of input data, little output data; e.g. data processing (radar, image recognition, etc)
Lost of input data, lots of output data; e.g. video stream processing, digital broadcast, etc.

http://www.nvidia.com/object/quadro_dvp.html seems to be of type 4 and from what I can tell from the website it uses SLI, some proprietary Nvidia communications channels between the capture, the GPU and the output. I think, though I might be wrong, that this approach does not lend itself to a generalisation. Whatever your data source is would have to comply to SLI. Most data capture cards would be some bespoke PCIe cards, most likely with some FPGA on it.

Your application seems to be of type 3. You need to shift 200MB/s onto the GPU, 200kB every ms, presumingly in a deterministic fashion, do some processing on these data and generate comparatively little data. I presume that these 200MB/s come from somewhere, too?

It is not quite clear to me what you mean by that.

Bottom line: Nvidia’s GPUs are a powerful and yet affordable processing platform. Yet it would be nice if Nvidia provided some framework that allowed application developpers to use GPUs for real time data processing by taking into account the entire chain, most importantly the link of getting data from a capture device onto the GPU.

Cheers,

peter

PeterW · June 30, 2010, 9:42am

Hello Julien

from what I have seen so far, there are several types of GPU processing applications:

Little input data, lots of output data; e.g. graphics rendering of a physical model
Little input data, little output data; e.g. numerical simulation
Lots of input data, little output data; e.g. data processing (radar, image recognition, etc)
Lost of input data, lots of output data; e.g. video stream processing, digital broadcast, etc.

http://www.nvidia.com/object/quadro_dvp.html seems to be of type 4 and from what I can tell from the website it uses SLI, some proprietary Nvidia communications channels between the capture, the GPU and the output. I think, though I might be wrong, that this approach does not lend itself to a generalisation. Whatever your data source is would have to comply to SLI. Most data capture cards would be some bespoke PCIe cards, most likely with some FPGA on it.

Your application seems to be of type 3. You need to shift 200MB/s onto the GPU, 200kB every ms, presumingly in a deterministic fashion, do some processing on these data and generate comparatively little data. I presume that these 200MB/s come from somewhere, too?

It is not quite clear to me what you mean by that.

Bottom line: Nvidia’s GPUs are a powerful and yet affordable processing platform. Yet it would be nice if Nvidia provided some framework that allowed application developpers to use GPUs for real time data processing by taking into account the entire chain, most importantly the link of getting data from a capture device onto the GPU.

Cheers,

peter

Topic		Replies	Views
From NIC to GPU. CUDA Programming and Performance	40	13663	February 12, 2011
cudaHostRegister crash or poor performance unknow error (30) in kernel for 64bit host operating syst CUDA Programming and Performance	23	5681	May 8, 2012
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204408	April 13, 2009
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2840	April 11, 2019
CUDA 4.0 CUDA Programming and Performance	63	507401	March 28, 2013
cudaMemcpy latency unusually high on some machines CUDA Programming and Performance	9	186	November 11, 2024
Establishing GPU processor and memory usage CUDA Programming and Performance	26	18929	November 17, 2008
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76298	February 14, 2010
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	9667	December 4, 2017
Very poor performance with cudaMemcpy3DAsync / cudaMemcpy3D for transfer of strided data (device to ... CUDA Programming and Performance	22	2585	July 10, 2019

Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel,

Related topics