for copy data from CUDA to host is so slowly, so is there any fast way to transfer data between cuda and openmax codec directly, without to host memory first
I’m not sure your use case, but maybe you can try gstreamer plugins with CUDA to see if any helps.
Regarding gstreamer plug-ins named gst-videocuda and gst-nvivafilter for CUDA post-process, L4T R23.1 Public release has gst-install script available to install gst 1.6.x version and cuda post-process support using “gst-videocuda & gst-nvivafilter” plugins is enabled using gst-1.6 with R23.1. Please refer “CUDA VIDEO POST_PROCESS WITH GSTREAMER-1.0” section of MM user guide for reference gst-launch-1.0 pipelines, which are also applicable for gst-1.6.
My use case: first omx jpeg decode, then cuda process, then omx h264 encode
I don’t want to use gstreamer,i want to use Api directly.
And I can’t find the soucre code of gst-videocuda & gst-nvivafilter
The GStreamer is only official API for that implemented on top of OMX.
The Multimedia User Guide can be downloaded from below link:
Please refer “GSTREAMER BUILD INSTRUCTIONS” section to install gst 1.6.x version and get cuda post-process support “gst-videocuda & gst-nvivafilter” plugins enabled.
Also can refer to “CUDA VIDEO POST_PROCESS WITH GSTREAMER-1.0” section to know how to get cuda post-process support using “gst-videocuda & gst-nvivafilter” plugins.
I am also interested in gstreamer/cuda processing: camera --> cuda processing --> h264 encode. I checked the multimedia user guide and saw the pipleine containing the nvivafilter plugin, but I did not find anything related how to build the actual cuda library which is used by the filter (libsample_process.so in the multimedia guide). Maybe it is a dumb question, but is there any example or documentation abut this?
You should be able to find fromlibsample_process.so from /usr/lib/arm-linux-gnueabihf/
Please follow the instruction to install gst 1.6.x version and get cuda post-process support “gst-videocuda & gst-nvivafilter” plugins enabled?
Also be sure the CUDA toolkit was installed.
I have gstreamer 1.6.0 and cuda toolkit 7.0 installed.
There is libsample_process.so in /usr/lib/arm-linux-gnueabihf/, but I haven’t found any information (sample, doc) in the toolkit which shows how to create this .so file to be used with nvivafilter and I haven’t found the source code of libsample_process.so.
The sample CUDA process sources are not made public due to IP concerns; once all relevant things settled down, we will provide the reference source link.
Please stay tuned.
I want to know if I can use this sample with Jetson Tegra K1.
No. This feature is only for TX1.
Latest L4T R24.2 public release @ https://developer.nvidia.com/embedded/linux-tegra provides sources of the libsample_process.so library.
You need to download nvsample_cudaprocess_src.tbz2 from source package link of R24.2 release page.
Please refer nvsample_cudaprocess_README.txt for the details of the interface APIs.
Source package also provides Makefile & instructions for on-target compilation.
Current reference CUDA sample implementation can be replaced with any custom CUDA op.
I hope this will help with the target use-cases.
As I understand the provided sample code: it receives an EGL input image, does processing and outputs the same EGL image.
It clearly works for the provided functionality, but for several image processing functions (e.g. convolution based filtering) this destructive behaviour is not possible, therefore I need separate buffer for the input and output of the cuda kernel.
The simplest (but not really efficient) way would be to allocate a temporary device buffer, copy the input EGL image data into this buffer and then use this device memory as an input for the cuda kernel, which overwrites the data in the EGL image. Unfortunately I was not able to allocate device memory: in the init function I get “invalid context”, in the gpu_processing function I get “invalid size” error (the latter is strange, the size is fine). I tried simple device memory allocation and allochost+getdevice pointer. The device pointers were static variables. hostalloc worked, but getting the device pointer failed.
So my questions:
- Is there a way to somehow allocate device memory?
- Would it possible to separate input and output EGL buffers in nvivafilter? This way memory could be omitted.
I am still unable to allocate any type of device memory in the provided nvsample_cudaprocess project. As a last chance I also tried to register the input EGL image twice (I know, dumb…), but although eglFrame.frame.pPitch is different, it still points to the same memory. So no luck with it.
I have the feeling that I am doing something wrong, but ran out of ideas. Does anyone have success with real image processing tasks done in gstreamer?
kayccc or apandya - can you help me with this issue?
As I guessed I screwed up something, now allocating a device memory works. That is:
- Allocate a large enough device memory on the first frame. cuMemAllocManaged seems to be the fastest.
- On every frame:
- Copy the eglframe data into the allocated device memory
- Run the kernel; input is the alloctaed device memory, output is the eglframe
This method works, but requires a memcpy, which wouldn’t be necessary if the plugin’s input and output eglframe would be allocated separately.
You may found interesting the following information about the GstCUDA framework, I think that is exactly what you are looking for. Below you will find a more detailed description, but in summary, it consists of a framework that allows to easily and optimally interface GStreamer with CUDA, guaranteeing zero memory copies. It also supports several inputs.
GstCUDA is a RidgeRun developed GStreamer plug-in enabling easy CUDA algorithm integration into GStreamer pipelines. GstCUDA offers a framework that allows users to develop custom GStreamer elements that execute any CUDA algorithm. The GstCUDA framework is a series of base classes abstracting the complexity of both CUDA and GStreamer. With GstCUDA, developers avoid writing elements from scratch, allowing the developer to focus on the algorithm logic, thus accelerating time to market.
GstCUDA offers a GStreamer plugin that contains a set of elements, that are ideal for GStreamer/CUDA quick prototyping. Those elements consist in a set of filters with different input/output pads combinations, that are run-time loadable with an external custom CUDA library that contains the algorithm to be executed on the GPU on each video frame that passes through the pipeline. GstCUDA plugin allows users to develop their own CUDA processing library, pass the library into the GstCUDA filter element that best adapts to the algorithm requirements, executes the library on the GPU, passing upstream frames from the GStreamer pipeline to the GPU and passing the modified frames downstream to the next element in the GStreamer pipeline. Those elements were created with the CUDA algorithm developer in mind - supporting quick prototyping and abstracting all GStreamer concepts. The elements are fully adaptable to different project needs, making GstCUDA a powerful tool that is essential for CUDA/GStreamer project development.
One remarkable feature of GstCUDA is that it provides a zero memory copy interface between CUDA and GStreamer on Jetson TX1/TX2 platforms. This enables heavy algorithms and large amounts of data (up to 2x 4K 60fps streams) to be processed on CUDA without the performance caused by copies or memory conversions. GstCUDA provides the necessary APIs to directly handle NVMM buffers to achieve the best possible performance on Jetson TX1/TX2 platforms. It provides a series of base classes and utilities that abstract the complexity of handle memory interface between GStreamer and CUDA, so the developer can focus on what actually gives value to the end product. GstCuda ensures an optimal performance for GStreamer/CUDA applications on Jetson platforms.
You can find detailed information about GstCUDA on the following link:
I hope this information can be useful to you.