OpenCV Performance TK1

Hi Everyone,

Using the Jetson-TK1 and it’s custom NVIDIA build opencv library I’ve written a piece of code that demosaics a Bayer image and resizes it using two methods:

  1. Regular cv::gpu::GpuMat
  2. cv::gpu::CudaMem with ALLOC_ZEROCOPY

I would expect the second method to be much faster as it makes use of the unified memory architecture in the TK1. But it’s 35% worse! Can anyone explain why that is? Also, when I disable both the resize and the demosaic the second method performs faster.

The example is 58 lines, see GitHub - Error323/gpumat-tk1: performance test unified memory

Are you sure that ALLOC_ZEROCOPY is supported by Tegra TK1?

From OpenCV doc:
ALLOC_ZEROCOPY specifies a zero copy memory allocation that enables mapping the host memory to GPU address space, if supported.

Yes I’m sure. From Data Structures — OpenCV 2.4.13.7 documentation you can see the function canMapHostMemory() which returns true on TK1. Furthermore CudaMem::createGpuMatFromHeader() would not work if it didn’t.

But yeah I really don’t understand why its performance is worse. It should be better since less copying is going on.

Could some of the people here please verify these results?

Did you max out the CPU and GPU clocks before doing the performance tests? By default they are adjusted frequently based on the load and on short tests they can be basically anything between 100 Mhz and 2000 MHz.

Edit: Link to the perf wiki page: [url]http://elinux.org/Jetson/Performance[/url]

Yes I did indeed. It doesn’t matter though, I’m referring to a relative difference between a regular GpuMat and CudaMem with zerocopy.

Here you can see a comparison of shared vs copied on the TK1. I perform a demosaic of a bayer pattern and a resize on the following resolutions:

100 500 1000 1500 2000 2500 3000 3500 4000

External Media

I’ll give this a final bump. Any NVIDIA devs out there that can confirm and/or tell what’s going on?

Reopening this as none of the channels I tried succeeded yet. Could anyone please verify and elaborate?

Hi Error323,

NVIDIA is considering posting a more detailed answer, but for now I can give you some hints.

“Zero-copy” in Tegra K1 (UVM-Lite) makes some CUDA kernels faster and some kernels slower. My guess is that regular color conversions are typically faster with zero-copy, whereas Bayer formats would be slower since it accesses pixels in an irregular pattern.

Zero-copy removes the delay of transferring memory between CPU & GPU, but in Tegra K1 the zero-copy memory won’t be cached as well as regular GPU memory. So zero-copy is more likely to be faster in small simple kernels that don’t access the same group of pixels more than once, while the traditional method is more likely to be faster in large complex kernels that access the same pixels many times.

I don’t have much experience with zero-copy myself, but here are some notes that might help tweak the memory performance for Tegra K1.

Using UVM-Lite on Tegra K1 (ie: allocating memory using “cudaMallocManaged()” from the UVM Lite API, to get a memory pointer that works on both CPU & GPU):

  • Don't modify the same memory on CPU & GPU at the same time. One potential strategy is to copy data into a second buffer, then while the CPU is processing 1 buffer, get the GPU to process the other buffer, they can run in parallel.
  • Launching a CUDA kernel will flush ALL caches used by both the CPU GPU.
  • Try using pinned memory pages instead of regular (pageable) memory, as this is often faster in Tegra K1.

I haven’t done a detailed empirical study but in my experience it’s faster to use pinned memory at the input & output of your processing chain and for all intermediate stages use buffers allocated via cudaMalloc( … ).

So far managed memory seems good for convenience rather than raw performance. Perhaps this will change or is not true for certain use cases.

Hi ShervinE,

Sorry for my late response, I only saw your reply recently.

I was under the impression that the TK1 does not have regular GPU memory? As such I don’t see why cudaMallocManaged() won’t be cached as well. Can you elaborate?

Thank you very much for your reply I’ll revisit my code given your suggestions and look forward to the more detailed answer from NVIDIA!

Error323

There’s no separate GPU memory but here’s a quote from Shervin’s other post:

I’m guess that means that there is a performance penalty when you access the same memory location often from both the CPU and GPU.

One thing to note that zero-copy does not automatically means peak performance. Memory access patterns can negate any advantage of zero-copy .

I think it would be useful to allow for creating an OpenCV GpuMat with an existing Gpu pointer.

I am writing a post on using UVA with CUDA and gpu::functions in OpenCV.

I will have my examples up soon on using OpenCV GpuMat without using upload() or download()

Hope it is helpful.

If it may concern, I have created an application to test a simple OpenCV pipeline on CPU and GPU using the different memory transfer methods.

It is available on Github:

On Jetson TK1 I get the following result:

opencv_cpu_vs_gpu 
===================

The test will be performed by averaging the timing of 100 iterations.

Testing performances of CPU...

 Memory 	 Total: 2278.68 msec 	 Mean: 22.7868 msec
 Resize 	 Total: 1442.41 msec 	 Mean: 14.4241 msec
 RGB2Gray 	 Total: 137.998 msec 	 Mean: 1.37998 msec
 Blur 		 Total: 186.578 msec 	 Mean: 1.86578 msec
 Canny 		 Total: 1788.31 msec 	 Mean: 17.8831 msec
---------------------------------------------------------------
Process 	 Total: 5835.49 msec 	 Mean: 58.3549 msec

Testing performances GPU with memory copy...

 Memory 	 Total: 3914.14 msec 	 Mean: 39.1414 msec
 Resize 	 Total: 528.038 msec 	 Mean: 5.28038 msec
 RGB2Gray 	 Total: 149.855 msec 	 Mean: 1.49855 msec
 Blur 		 Total: 1515.34 msec 	 Mean: 15.1534 msec
 Canny 		 Total: 1613.99 msec 	 Mean: 16.1399 msec
---------------------------------------------------------------
Process 	 Total: 7723.17 msec 	 Mean: 77.2317 msec

Testing performances of GPU with ZEROCOPY...

 Memory 	 Total: 3708.36 msec 	 Mean: 37.0836 msec
 Resize 	 Total: 3228.88 msec 	 Mean: 32.2888 msec
 RGB2Gray 	 Total: 135.844 msec 	 Mean: 1.35844 msec
 Blur 		 Total: 115.621 msec 	 Mean: 1.15621 msec
 Canny 		 Total: 1630.7 msec 	 Mean: 16.307 msec
---------------------------------------------------------------
Process 	 Total: 8820.61 msec 	 Mean: 88.2061 msec

Testing performances of GPU with Memory Managed...

 Memory 	 Total: 2720.48 msec 	 Mean: 27.2048 msec
 Resize 	 Total: 604.373 msec 	 Mean: 6.04373 msec
 RGB2Gray 	 Total: 202.85 msec 	 Mean: 2.0285 msec
 Blur 		 Total: 188.154 msec 	 Mean: 1.88154 msec
 Canny 		 Total: 1966.76 msec 	 Mean: 19.6676 msec
---------------------------------------------------------------
Process 	 Total: 5936.44 msec 	 Mean: 59.3644 msec

“Memory” is the pipeline part related to “Memory transfer”: “memcpy” for CPU and “managed memory”, “upload” for GPU, “CudaMem” for ZEROCOPY…

I’m going to use it on TX1 and on TX2 to better understand which is the best way to use OpenCV on Jetson boards.