OpenCV Performance TK1

Error323 · February 9, 2015, 7:52pm

Hi Everyone,

Using the Jetson-TK1 and it’s custom NVIDIA build opencv library I’ve written a piece of code that demosaics a Bayer image and resizes it using two methods:

Regular cv::gpu::GpuMat
cv::gpu::CudaMem with ALLOC_ZEROCOPY

I would expect the second method to be much faster as it makes use of the unified memory architecture in the TK1. But it’s 35% worse! Can anyone explain why that is? Also, when I disable both the resize and the demosaic the second method performs faster.

The example is 58 lines, see GitHub - Error323/gpumat-tk1: performance test unified memory

Myzhar · February 10, 2015, 7:14am

Are you sure that ALLOC_ZEROCOPY is supported by Tegra TK1?

From OpenCV doc:
ALLOC_ZEROCOPY specifies a zero copy memory allocation that enables mapping the host memory to GPU address space, if supported.

Error323 · February 10, 2015, 7:55am

Yes I’m sure. From Data Structures — OpenCV 2.4.13.7 documentation you can see the function canMapHostMemory() which returns true on TK1. Furthermore CudaMem::createGpuMatFromHeader() would not work if it didn’t.

But yeah I really don’t understand why its performance is worse. It should be better since less copying is going on.

Error323 · February 10, 2015, 11:07am

Could some of the people here please verify these results?

kulve · February 10, 2015, 1:29pm

Did you max out the CPU and GPU clocks before doing the performance tests? By default they are adjusted frequently based on the load and on short tests they can be basically anything between 100 Mhz and 2000 MHz.

Edit: Link to the perf wiki page: [url]http://elinux.org/Jetson/Performance[/url]

Error323 · February 10, 2015, 1:58pm

Yes I did indeed. It doesn’t matter though, I’m referring to a relative difference between a regular GpuMat and CudaMem with zerocopy.

Error323 · February 10, 2015, 2:46pm

Here you can see a comparison of shared vs copied on the TK1. I perform a demosaic of a bayer pattern and a resize on the following resolutions:

100 500 1000 1500 2000 2500 3000 3500 4000

External Media

Error323 · February 11, 2015, 9:43am

I’ll give this a final bump. Any NVIDIA devs out there that can confirm and/or tell what’s going on?

Error323 · March 1, 2015, 10:05am

Reopening this as none of the channels I tried succeeded yet. Could anyone please verify and elaborate?

ShervinE · March 6, 2015, 6:24pm

Hi Error323,

NVIDIA is considering posting a more detailed answer, but for now I can give you some hints.

“Zero-copy” in Tegra K1 (UVM-Lite) makes some CUDA kernels faster and some kernels slower. My guess is that regular color conversions are typically faster with zero-copy, whereas Bayer formats would be slower since it accesses pixels in an irregular pattern.

Zero-copy removes the delay of transferring memory between CPU & GPU, but in Tegra K1 the zero-copy memory won’t be cached as well as regular GPU memory. So zero-copy is more likely to be faster in small simple kernels that don’t access the same group of pixels more than once, while the traditional method is more likely to be faster in large complex kernels that access the same pixels many times.

I don’t have much experience with zero-copy myself, but here are some notes that might help tweak the memory performance for Tegra K1.

Using UVM-Lite on Tegra K1 (ie: allocating memory using “cudaMallocManaged()” from the UVM Lite API, to get a memory pointer that works on both CPU & GPU):

Don't modify the same memory on CPU & GPU at the same time. One potential strategy is to copy data into a second buffer, then while the CPU is processing 1 buffer, get the GPU to process the other buffer, they can run in parallel.
Launching a CUDA kernel will flush ALL caches used by both the CPU GPU.
Try using pinned memory pages instead of regular (pageable) memory, as this is often faster in Tegra K1.

Jimmy_Pettersson · March 7, 2015, 2:29pm

I haven’t done a detailed empirical study but in my experience it’s faster to use pinned memory at the input & output of your processing chain and for all intermediate stages use buffers allocated via cudaMalloc( … ).

So far managed memory seems good for convenience rather than raw performance. Perhaps this will change or is not true for certain use cases.

Error323 · March 20, 2015, 10:59am

Hi ShervinE,

Sorry for my late response, I only saw your reply recently.

I was under the impression that the TK1 does not have regular GPU memory? As such I don’t see why cudaMallocManaged() won’t be cached as well. Can you elaborate?

Thank you very much for your reply I’ll revisit my code given your suggestions and look forward to the more detailed answer from NVIDIA!

Error323

kulve · March 20, 2015, 11:28am

There’s no separate GPU memory but here’s a quote from Shervin’s other post:

I’m guess that means that there is a performance penalty when you access the same memory location often from both the CPU and GPU.

busta78 · March 20, 2015, 7:07pm

One thing to note that zero-copy does not automatically means peak performance. Memory access patterns can negate any advantage of zero-copy .

Nicholas_762 · March 21, 2015, 1:11am

…

Error323 · March 24, 2015, 7:34am

I think it would be useful to allow for creating an OpenCV GpuMat with an existing Gpu pointer.

dwyer2bp · March 15, 2017, 10:21pm

I am writing a post on using UVA with CUDA and gpu::functions in OpenCV.

I will have my examples up soon on using OpenCV GpuMat without using upload() or download()

Hope it is helpful.

Myzhar · April 6, 2017, 11:34am

If it may concern, I have created an application to test a simple OpenCV pipeline on CPU and GPU using the different memory transfer methods.

It is available on Github:

On Jetson TK1 I get the following result:

opencv_cpu_vs_gpu 
===================

The test will be performed by averaging the timing of 100 iterations.

Testing performances of CPU...

 Memory 	 Total: 2278.68 msec 	 Mean: 22.7868 msec
 Resize 	 Total: 1442.41 msec 	 Mean: 14.4241 msec
 RGB2Gray 	 Total: 137.998 msec 	 Mean: 1.37998 msec
 Blur 		 Total: 186.578 msec 	 Mean: 1.86578 msec
 Canny 		 Total: 1788.31 msec 	 Mean: 17.8831 msec
---------------------------------------------------------------
Process 	 Total: 5835.49 msec 	 Mean: 58.3549 msec

Testing performances GPU with memory copy...

 Memory 	 Total: 3914.14 msec 	 Mean: 39.1414 msec
 Resize 	 Total: 528.038 msec 	 Mean: 5.28038 msec
 RGB2Gray 	 Total: 149.855 msec 	 Mean: 1.49855 msec
 Blur 		 Total: 1515.34 msec 	 Mean: 15.1534 msec
 Canny 		 Total: 1613.99 msec 	 Mean: 16.1399 msec
---------------------------------------------------------------
Process 	 Total: 7723.17 msec 	 Mean: 77.2317 msec

Testing performances of GPU with ZEROCOPY...

 Memory 	 Total: 3708.36 msec 	 Mean: 37.0836 msec
 Resize 	 Total: 3228.88 msec 	 Mean: 32.2888 msec
 RGB2Gray 	 Total: 135.844 msec 	 Mean: 1.35844 msec
 Blur 		 Total: 115.621 msec 	 Mean: 1.15621 msec
 Canny 		 Total: 1630.7 msec 	 Mean: 16.307 msec
---------------------------------------------------------------
Process 	 Total: 8820.61 msec 	 Mean: 88.2061 msec

Testing performances of GPU with Memory Managed...

 Memory 	 Total: 2720.48 msec 	 Mean: 27.2048 msec
 Resize 	 Total: 604.373 msec 	 Mean: 6.04373 msec
 RGB2Gray 	 Total: 202.85 msec 	 Mean: 2.0285 msec
 Blur 		 Total: 188.154 msec 	 Mean: 1.88154 msec
 Canny 		 Total: 1966.76 msec 	 Mean: 19.6676 msec
---------------------------------------------------------------
Process 	 Total: 5936.44 msec 	 Mean: 59.3644 msec

“Memory” is the pipeline part related to “Memory transfer”: “memcpy” for CPU and “managed memory”, “upload” for GPU, “CudaMem” for ZEROCOPY…

I’m going to use it on TX1 and on TX2 to better understand which is the best way to use OpenCV on Jetson boards.

Topic		Replies	Views
Performance of zero-copy on jetson TX1 Jetson TX1	9	2170	October 18, 2021
How can I figure out how much device (GPU) memory the TK1 has? Jetson TK1	4	4518	December 20, 2014
Regarding Usage of Zero Copy on TX1 to improve performance Jetson TX1	1	3253	March 15, 2016
CUDA memory performance Jetson TK1	3	1147	October 18, 2021
Kernel lunch overhead increases significantly (10x) when using unified memory on TK1 and TX1 Jetson TK1	5	3368	August 31, 2018
Tegra K1 MatVec Multiplication Benchmark Revision (Zero Copy vs Unified Memory) CUDA Programming and Performance	3	1335	February 14, 2016
Zero Copy vs. CudaMemcpy on Jetson TK1 Jetson TK1	4	1527	May 18, 2016
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	20779	October 18, 2021
How to disable zero-copy on TX1? Jetson TX1	4	777	October 18, 2021
Zero-Copy and Managed memory on Jetson Jetson TX1	9	11871	August 20, 2018

OpenCV Performance TK1

Related topics