# Parallelizing pixel value assignment?

I am currently trying to assign pixel value from one image to another image this operation seemed to be taking longer time when iterating through nested for loop, any idea as to how to do the same operation using cuda kernels and opencv matrix. Here is the piece of code that I’m trying to parallelize,

``````double radius = frame.rows/2;
double arcLength = 2.0*CV_PI*radius;
cv::Point cent(frame.cols/2, frame.rows/2);

for(double y=0; y < radius; y++){
for(double x=0; x < arcLength; x++){
int xT = (int)round((double)cent.x + (y) * cos( 2*CV_PI*(x/ arcLength)));
int yT = (int)round((double)cent.y + (y) * sin( 2*CV_PI*(x/ arcLength)));
Point cur_pos = Point(x,radius-y-1);
dst.at < Vec3b > (Point(x,radius-y-1)) = frame.at < Vec3b > (Point(xT,yT));
}
}
``````

I would like to move this block of code to device memory and perform the calculations there in order to achieve speed up. Any help would be appreciated. Thanks.

Unless your data is already resident on the GPU, and the images are sufficiently large to make good use of the GPU’s massive parallelism, simply porting to the GPU may not do much good.

It seems unlikely that you need `double` computation. Consider using `float` instead.

You are computing `sin()` and `cos()` of the same angle. Use `sincos()` instead. Better yet, since the code computes `sin (PI * <expression>)`, use `sincospi()`.

Avoid expensive divisions in the innermost loop. Precompute the reciprocal of `arcLength` once and multiply inside the loop.

Do you need to use the `round()` function? I am guessing the answer is “yes”, to prevent artifacts caused by round-to-nearest-or-even (4.5 -> 4, 5.5 -> 6). If that’s not the case, try `rint()` instead.

Have you profiled this code? Despite the trig function computation and divisions in the innermost loop, it may be limited by memory bandwidth.

Are you compiling the CPU code with maximum optimization and vectorization enabled? The code structure seems simple enough that autovectorization should apply?

@njuffa how to compile with maximum optimization and vectorization enabled?

The documentation of your tool chain should tell you that.

@njuffa sorry if it’s trivial, what does tool chain mean here in this context?

Your compiler with accompanying libraries and utilities.