Real-time GPU processing Peer 2 peer data copy, Linux kernel memory, kernels in kernel,

The Xilinx driver does not provide an ioctl, but I can add one. There are two cases, though:

  1. The ioctl passes a pointer obtained from a cudaMallocHost into the kernel, the ioctl file op then sets of a DAM transfer directly into this area. The only question I have here: how does the kernel obtain the kernel vitrual memory address or he physical address from the pointer obtained by cudaMallocHost? Imagine the program that requested the cudaMallocHost memory dies while the driver is still DMA-ing data into it.
  2. use a read system call, DMA into a slab and then copy_to_user(). So one additional memcpy.
    the issue is: Let’s have a piece of reserved, contiguous kernel memory and have it use by both cuda and my FPGA driver. So any cudaMemcpy would use DMA to the GPU, and I can DMA data from the FPGA into it, too.
    peter

Peter, so far I have gone down the path of use case #1. It seems like what we are trying to achieve is quite similar, eg. pipeline an FPGA → GPU with lowest latency possible. In fact, I’ve even heard of some people in this region that are starting to use Nvidia GPU for radar processing ;)

I am also using a Xylinx V5. I ioctl the buffers allocated through cuMemHostAlloc() to our driver, lookup the vma through find_vma(), and calculate the physical address from that. The problem, like you pointed out, is that if the app crashes and the NV driver removes the mapping while the FPGA is still writing to it, boom the machine’s dead. I’m going to try adding references to the vm_area_struct so hopefully the kernel will keep the mapping around even after NV discards it.

I was hoping that there would be some improvements in Cuda 3.1 for doing this kind of thing, the release notes even made a big point about it, alas there is nothing new in the toolkit.

You guys mentioned earlier that I said you could modify the low-level NV driver to accomplish this, that would be approaching this problem from the other end. I have implemented some changes in their allocator, but got this way up and running first. It was easier to pass memory down into a driver you have the source to then try to pass it up from the depths of the Nvidia driver and out through Cuda.

Well, that’s our plan, ~2GB/s in chunks of 16MB every 8.4ms.

It is certainly one possibility, but it strikes me as error prone if a user space program allocates something a kernel driver is then relying on. It should really be the other way round.

From a business perspective, it is not desirable that an application relies on modification in NVidia’s driver. Not only would any warranty be waived, but also the same patch has to be applied and tested for every CUDA release.

The perfect compromise would be if Nvidia would offer something like

cudaMallocHost(void **ptr_out, void *ptr_in, size_t size)

ptr_in could be a physical address of reserved memory used by the driver. Alternatively the device driver allocates some contiguous kernel memory, a user space application mmaps it into user vitural memoryna and passes it on to cudaMallocHost which would simply remap it into the driver’s kernel virtual memory space.

The result is that several parties access the same memory window, the device driver, the nvidia driver and a user space program. The user will have to make sure that they do not step on each other’s feet.

Cheers,

peter

Well, that’s our plan, ~2GB/s in chunks of 16MB every 8.4ms.

It is certainly one possibility, but it strikes me as error prone if a user space program allocates something a kernel driver is then relying on. It should really be the other way round.

From a business perspective, it is not desirable that an application relies on modification in NVidia’s driver. Not only would any warranty be waived, but also the same patch has to be applied and tested for every CUDA release.

The perfect compromise would be if Nvidia would offer something like

cudaMallocHost(void **ptr_out, void *ptr_in, size_t size)

ptr_in could be a physical address of reserved memory used by the driver. Alternatively the device driver allocates some contiguous kernel memory, a user space application mmaps it into user vitural memoryna and passes it on to cudaMallocHost which would simply remap it into the driver’s kernel virtual memory space.

The result is that several parties access the same memory window, the device driver, the nvidia driver and a user space program. The user will have to make sure that they do not step on each other’s feet.

Cheers,

peter

Yes, a cudaMallocHost() that take in physical ptrs would be the ideal solution, like you have said. There was talk of this feature “coming soon” here on the forums more than a year ago, after which any reference to it promptly disappeared. It looks like they planned to expose an API like this in the 3.1 release (GPUDirect), but there must have been some kind of mix-up because there isn’t anything new in the toolkit or the drivers.

I found that my system was unstable because I was doing the virtual → physical translation incorrectly. Previously, I had used find_vma() and calculated the physical address as an offset from that. I found the correct way to do it in nv.c in the driver, in the function nv_kern_get_physical() - you have to walk the page-tables. Now everything runs smoothly, even if the process crashes (as long as I stop the FPGA shortly thereafter - the kernel will close the dev file for you). I also find patching the Nvidia driver to be undesirable in many aspects and am glad that this approach works. For my experiments though, it has been very useful (mostly enabling debugging and adding printk’s).

8.4ms is ambitious, good luck! In my experience, the latency requirements for audio (1-5ms) < ladar (10-20ms) < video (30ms), all of which are now possible with ZeroCopy. Nvidia will maintain that direct GPU communication is wholy unnessessary, but I know they plan on this at some point and are just taking it slow at the moment.

Yes, a cudaMallocHost() that take in physical ptrs would be the ideal solution, like you have said. There was talk of this feature “coming soon” here on the forums more than a year ago, after which any reference to it promptly disappeared. It looks like they planned to expose an API like this in the 3.1 release (GPUDirect), but there must have been some kind of mix-up because there isn’t anything new in the toolkit or the drivers.

I found that my system was unstable because I was doing the virtual → physical translation incorrectly. Previously, I had used find_vma() and calculated the physical address as an offset from that. I found the correct way to do it in nv.c in the driver, in the function nv_kern_get_physical() - you have to walk the page-tables. Now everything runs smoothly, even if the process crashes (as long as I stop the FPGA shortly thereafter - the kernel will close the dev file for you). I also find patching the Nvidia driver to be undesirable in many aspects and am glad that this approach works. For my experiments though, it has been very useful (mostly enabling debugging and adding printk’s).

8.4ms is ambitious, good luck! In my experience, the latency requirements for audio (1-5ms) < ladar (10-20ms) < video (30ms), all of which are now possible with ZeroCopy. Nvidia will maintain that direct GPU communication is wholy unnessessary, but I know they plan on this at some point and are just taking it slow at the moment.

Let’s hope for a 3.2 with this feature added (please, Nvidia). And perhaps cuda 3.2 and SDK will install without hacks on ubuntu 10.04, too.

The total latency incurred by the processing pipeline will be greater, but that’s less of a concern. The FGPA device driver’s task is to shift data blocks from the FGPA card to RAM with a time budget of 8.4ms, notify another process that will asynchronously copy the new data block onto the GPU, and immediately start processing the previous block with also a budget of 8.4ms. My concern is the deterministic execution of the pipeline.

Cheers,

peter

Let’s hope for a 3.2 with this feature added (please, Nvidia). And perhaps cuda 3.2 and SDK will install without hacks on ubuntu 10.04, too.

The total latency incurred by the processing pipeline will be greater, but that’s less of a concern. The FGPA device driver’s task is to shift data blocks from the FGPA card to RAM with a time budget of 8.4ms, notify another process that will asynchronously copy the new data block onto the GPU, and immediately start processing the previous block with also a budget of 8.4ms. My concern is the deterministic execution of the pipeline.

Cheers,

peter

We already know that this is an absolutely essential feature for Nvidia in the HPC market; GPU supercomputing clusters were left basically useless until they wrote the Infiniband driver with Mellanox. And people have been talking about it for a while, so hopefully it will be released soon. It’s a question of [a] opening it up for general use and [b] profitability of ‘GPUDirect’. In the first case, there are obvious concerns for abuse and malicious code, and in the second, it’s kind of like how the Quadro/Tesla drivers have special functionality over the Geforce drivers. Speaking of Quadro vs. Geforce, an awesome feature of Cuda 3.1 is the inclusion of PCI bus and device ID in the cuda device caps. Before, it was practically impossible to pair up a Geforce OpenGL context and Cuda context in multi-GPU systems. Now you can do it through NVAPI (Windows) or libXNVCtrl (Linux). Before I had to do it by executing a bunch of stuff through Cuda and monitoring temperatures through libXNVCtrl :P

We already know that this is an absolutely essential feature for Nvidia in the HPC market; GPU supercomputing clusters were left basically useless until they wrote the Infiniband driver with Mellanox. And people have been talking about it for a while, so hopefully it will be released soon. It’s a question of [a] opening it up for general use and [b] profitability of ‘GPUDirect’. In the first case, there are obvious concerns for abuse and malicious code, and in the second, it’s kind of like how the Quadro/Tesla drivers have special functionality over the Geforce drivers. Speaking of Quadro vs. Geforce, an awesome feature of Cuda 3.1 is the inclusion of PCI bus and device ID in the cuda device caps. Before, it was practically impossible to pair up a Geforce OpenGL context and Cuda context in multi-GPU systems. Now you can do it through NVAPI (Windows) or libXNVCtrl (Linux). Before I had to do it by executing a bunch of stuff through Cuda and monitoring temperatures through libXNVCtrl :P

Seconded; if you have 1TFLOP processing power provided by one Tesla C1060, you can make use of it by giving it a computational task with little external input such as a numerical simulations. Or, you can make use of it by crunching data, in which case you have to get data onto the system. And data have to come from somewhere, either being produced by the host (unlikely), or by some peripheral device (more likely), in which case transfer from a peripheral device onto the GPU becomes essential. The GPU constitutes one link in a processing chain, and the chain is as strong as the weakest link, quite likely this is a deterministic and efficient data throughput. I would expect nvidia drivers to support the concept of data processing pipelines.

Cheers,

peter

Seconded; if you have 1TFLOP processing power provided by one Tesla C1060, you can make use of it by giving it a computational task with little external input such as a numerical simulations. Or, you can make use of it by crunching data, in which case you have to get data onto the system. And data have to come from somewhere, either being produced by the host (unlikely), or by some peripheral device (more likely), in which case transfer from a peripheral device onto the GPU becomes essential. The GPU constitutes one link in a processing chain, and the chain is as strong as the weakest link, quite likely this is a deterministic and efficient data throughput. I would expect nvidia drivers to support the concept of data processing pipelines.

Cheers,

peter

Dear all

I also have a real-time data processing application. I typically need to process 200kB of input data every ms, and output the ~4kB result packet to the outside world. I only have ~100us to process the last input bytes, so I really need pipeline !

Does anybody knows how NVIDIA digital video pipeline solutions works ?
( [url=“http://www.nvidia.com/object/quadro_dvp.html”]http://www.nvidia.com/object/quadro_dvp.html[/url] )

Apparently all data transfer are done through PCIe. It should implement solutions
that we could reuse for other RT data processing.

Dear all

I also have a real-time data processing application. I typically need to process 200kB of input data every ms, and output the ~4kB result packet to the outside world. I only have ~100us to process the last input bytes, so I really need pipeline !

Does anybody knows how NVIDIA digital video pipeline solutions works ?
( [url=“Quadro & RTX Professional Design & Visualization Solutions | NVIDIA”]http://www.nvidia.com/object/quadro_dvp.html[/url] )

Apparently all data transfer are done through PCIe. It should implement solutions
that we could reuse for other RT data processing.

Hello Julien

from what I have seen so far, there are several types of GPU processing applications:

  1. Little input data, lots of output data; e.g. graphics rendering of a physical model

  2. Little input data, little output data; e.g. numerical simulation

  3. Lots of input data, little output data; e.g. data processing (radar, image recognition, etc)

  4. Lost of input data, lots of output data; e.g. video stream processing, digital broadcast, etc.

http://www.nvidia.com/object/quadro_dvp.html seems to be of type 4 and from what I can tell from the website it uses SLI, some proprietary Nvidia communications channels between the capture, the GPU and the output. I think, though I might be wrong, that this approach does not lend itself to a generalisation. Whatever your data source is would have to comply to SLI. Most data capture cards would be some bespoke PCIe cards, most likely with some FPGA on it.

Your application seems to be of type 3. You need to shift 200MB/s onto the GPU, 200kB every ms, presumingly in a deterministic fashion, do some processing on these data and generate comparatively little data. I presume that these 200MB/s come from somewhere, too?

It is not quite clear to me what you mean by that.

Bottom line: Nvidia’s GPUs are a powerful and yet affordable processing platform. Yet it would be nice if Nvidia provided some framework that allowed application developpers to use GPUs for real time data processing by taking into account the entire chain, most importantly the link of getting data from a capture device onto the GPU.

Cheers,

peter

Hello Julien

from what I have seen so far, there are several types of GPU processing applications:

  1. Little input data, lots of output data; e.g. graphics rendering of a physical model

  2. Little input data, little output data; e.g. numerical simulation

  3. Lots of input data, little output data; e.g. data processing (radar, image recognition, etc)

  4. Lost of input data, lots of output data; e.g. video stream processing, digital broadcast, etc.

http://www.nvidia.com/object/quadro_dvp.html seems to be of type 4 and from what I can tell from the website it uses SLI, some proprietary Nvidia communications channels between the capture, the GPU and the output. I think, though I might be wrong, that this approach does not lend itself to a generalisation. Whatever your data source is would have to comply to SLI. Most data capture cards would be some bespoke PCIe cards, most likely with some FPGA on it.

Your application seems to be of type 3. You need to shift 200MB/s onto the GPU, 200kB every ms, presumingly in a deterministic fashion, do some processing on these data and generate comparatively little data. I presume that these 200MB/s come from somewhere, too?

It is not quite clear to me what you mean by that.

Bottom line: Nvidia’s GPUs are a powerful and yet affordable processing platform. Yet it would be nice if Nvidia provided some framework that allowed application developpers to use GPUs for real time data processing by taking into account the entire chain, most importantly the link of getting data from a capture device onto the GPU.

Cheers,

peter