I am currently trying to assign pixel value from one image to another image this operation seemed to be taking longer time when iterating through nested for loop, any idea as to how to do the same operation using cuda kernels and opencv matrix. Here is the piece of code that I’m trying to parallelize,

```
double radius = frame.rows/2;
double arcLength = 2.0*CV_PI*radius;
Mat dst(radius,arcLength,CV_8UC3,Scalar::all(0));
cv::Point cent(frame.cols/2, frame.rows/2);
for(double y=0; y < radius; y++){
for(double x=0; x < arcLength; x++){
int xT = (int)round((double)cent.x + (y) * cos( 2*CV_PI*(x/ arcLength)));
int yT = (int)round((double)cent.y + (y) * sin( 2*CV_PI*(x/ arcLength)));
Point cur_pos = Point(x,radius-y-1);
dst.at < Vec3b > (Point(x,radius-y-1)) = frame.at < Vec3b > (Point(xT,yT));
}
}
```

I would like to move this block of code to device memory and perform the calculations there in order to achieve speed up. Any help would be appreciated. Thanks.