Parallelizing pixel value assignment?

I am currently trying to assign pixel value from one image to another image this operation seemed to be taking longer time when iterating through nested for loop, any idea as to how to do the same operation using cuda kernels and opencv matrix. Here is the piece of code that I’m trying to parallelize,

double radius = frame.rows/2;
double arcLength = 2.0*CV_PI*radius;
Mat dst(radius,arcLength,CV_8UC3,Scalar::all(0));
cv::Point cent(frame.cols/2, frame.rows/2);

for(double y=0; y < radius; y++){
   for(double x=0; x < arcLength; x++){
      int xT = (int)round((double)cent.x + (y) * cos( 2*CV_PI*(x/ arcLength)));
      int yT = (int)round((double)cent.y + (y) * sin( 2*CV_PI*(x/ arcLength)));
      Point cur_pos = Point(x,radius-y-1);
      dst.at < Vec3b > (Point(x,radius-y-1)) = frame.at < Vec3b > (Point(xT,yT));
 }
}

I would like to move this block of code to device memory and perform the calculations there in order to achieve speed up. Any help would be appreciated. Thanks.

Unless your data is already resident on the GPU, and the images are sufficiently large to make good use of the GPU’s massive parallelism, simply porting to the GPU may not do much good.

It seems unlikely that you need double computation. Consider using float instead.

You are computing sin() and cos() of the same angle. Use sincos() instead. Better yet, since the code computes sin (PI * <expression>), use sincospi().

Avoid expensive divisions in the innermost loop. Precompute the reciprocal of arcLength once and multiply inside the loop.

Do you need to use the round() function? I am guessing the answer is “yes”, to prevent artifacts caused by round-to-nearest-or-even (4.5 → 4, 5.5 → 6). If that’s not the case, try rint() instead.

Have you profiled this code? Despite the trig function computation and divisions in the innermost loop, it may be limited by memory bandwidth.

Are you compiling the CPU code with maximum optimization and vectorization enabled? The code structure seems simple enough that autovectorization should apply?

@njuffa how to compile with maximum optimization and vectorization enabled?

The documentation of your tool chain should tell you that.

@njuffa sorry if it’s trivial, what does tool chain mean here in this context?

Your compiler with accompanying libraries and utilities.