Unexpected opencv performance on TK1, and CUDA crash

Dawen · March 14, 2017, 11:08pm

Hello developers,

I’m writing a simple piece of code which is trying to use the OpenCV 3.2.0 library’s CUDA optimised code to do some image processing. However, I got an unexpectedly poor performance on my TK1. Performing a box filter on a 4k by 1k image using kernel size of 5 runs for 0.017 seconds on CPU. I have tested the same box filter on GPU, but I got an unexpected execution time of 2 seconds. What’s worse is, when I tried to use a loop to filter 25 images, the GPU process got killed. This is rather poor. I have attached the code. Would you guys help me with this problem? I would be grateful if you could share me some docs on GPU programming.

In the code below, SPImage is a class I wrote for generating images containing random salt&pepper noise. Full code could be found in https://github.com/sthy14/TK1CV

#include "SPImage.hpp"
#include "omp.h"
#include <vector>
#include <opencv2/opencv.hpp>
#include <string>
#include <iostream>
#include <fstream>

#define NUMBER 1

using namespace cv;
using namespace std;

int main(int argc, char* argv[]) {
  cuda::DeviceInfo gpu;
  if (!gpu.isCompatible()) {
    cout << "GPU is not compatible\n";
    exit(-1);
  }
  SPImage generator;
  Mat image[25];
  for (int i = 0; i < NUMBER; i++) {
    image[i] = generator.Generate(4096, 1024);
    string fileName = "imgO" + to_string(i) + ".bmp";
    imwrite(fileName, image[i]);
  }
  cuda::GpuMat gImage[25];
  for (int i = 0; i < NUMBER; i++) {
    gImage[i].upload(image[i]);
  }
  Ptr<cuda::Filter> gpuBlur = cuda::createBoxFilter(CV_8UC1, CV_8UC1, Size(5, 5));
  Mat imageOut[25];
  cuda::GpuMat gOut[25];
  Mat gImageOut[25];
  Mat mpImageOut[25];

  double sStart = (double) getTickCount();
  for (int i = 0; i < NUMBER; i++) {
    blur(image[i], imageOut[i], Size(5, 5));
  }
  double sTime = ((double) getTickCount() - sStart) / getTickFrequency();
  cout << sTime << "\n";

  double mpStart = (double) getTickCount();
#pragma omp parallel for
  for (int i = 0; i < NUMBER; i++) {
    blur(image[i], mpImageOut[i], Size(5, 5));
  }
  double mpTime = ((double) getTickCount() - mpStart) / getTickFrequency();
  cout << mpTime << "\n";

  double gStart = (double) getTickCount();
  for (int i = 0; i < NUMBER; i++) {
    gpuBlur->apply(gImage[i], gOut[i]);
  }
  double gTime = ((double) getTickCount() - gStart) / getTickFrequency();
  cout << gTime << "\n";

  for (int i = 0; i < 25; i++) {
    gOut[i].download(gImageOut[i]);
  }

  for (int i = 0; i < NUMBER; i++) {
    // Output processed images.
    string sName = "imgS" + to_string(i) + ".bmp";
    string gName = "imgG" + to_string(i) + ".bmp";
    string mpName = "imgMP" + to_string(i) + ".bmp";
    imwrite(sName, imageOut[i]);
    imwrite(gName, gImageOut[i]);
    imwrite(mpName, mpImageOut[i]);
  }
}

Kind regards,

Dawen

AastaLLL · March 15, 2017, 5:41am

Hi,

Thanks for your question.
Could you maximize the cpu/gpu frequency and try it again.

sudo ./tegrastats

Thanks.

Dawen · March 15, 2017, 9:29am

Hi,

Thank you for your reply. I followed the instructions in [url]http://elinux.org/Jetson/Performance[/url] to set the performance. The performance of GPU is slightly better, the blurring time decreases from 2sec to 1sec. However, this is still far more than CPU process(0.015s). Are there any other tunes I could try, or any guides I could follow?

Thanks.

Honey_Patouceul · March 15, 2017, 7:09pm

Could you make a previous call to gpublur before your gpu measurement loop ?

<b>gpuBlur->apply(gImage[0], gOut[0]);</b>
double gStart = (double) getTickCount();
...

I use to experience long delay for the first call. Seems it takes time to load/initialize everything (cuda/opencv/…). This setup time is also variable, making latency measurements not repeatable.

If someone knows a clean way to get everything ready, please share.

[EDIT: you may also have a look to http://stackoverflow.com/questions/31523216/opencv-gpu-blurring-is-slow ]

Topic		Replies	Views
Opencv4Tegra GPU vs CPU TK1 vs TX1 Jetson TX1 opencv	3	3738	April 28, 2016
CUDA is so slow Jetson Nano opencv	5	1418	June 30, 2022
Slow performance with opencv at jetson tx2 Jetson TX2	13	4082	October 18, 2021
Gaussian filtering computed by cpu through opencv is much faster than Gaussian filtering through CUDA on jetson nano Jetson Nano cuda , gpu-computing	4	198	December 5, 2024
Performance degradation on CUDA Jetson TX2	10	2356	October 18, 2021
"Slow" OpenCV performance on TK1 (Shield Tablet / Android) GPU-Accelerated Libraries opencv	0	1309	September 1, 2014
[Performance] I cannot get better performance with OpenCV GPU-accelerated API. Jetson TX1	5	4103	October 18, 2021
Performance issue on opencv TX1 Jetson TX1	8	910	October 18, 2021
Jetson TK1 OpenCV performance Jetson TK1 opencv	5	1526	July 15, 2016
unspecified launch failure in function caller using OpenCV for Tegra 2.4.13 Jetson TX1	4	2168	October 18, 2021

Unexpected opencv performance on TK1, and CUDA crash

Related topics