The time of putting image data into cuda memory increase a a lot when the image become larger

I use tensorrt do inference on TX2 platform. the Jetpack version is 3.3. When the input image size is 400225, everything is fine.The time of putting image data into cuda memory is 1ms. However, the time of putting image data into cuda memory increase a a lot, becomes 10ms when the image size is 500288.

the code below is how i put image data into cuda memory, mInputCPU is mapped to an gpu memory of CUDA
by cudaHostAlloc().

for( uint32_t y=0; y < INPUT_H; y++)
{
for( uint32_t x=0; x < INPUT_W; x++)
{
cv::Vec3b intensity = dstImage.atcv::Vec3b(y, x);
mInputCPU[imgPixels * 0 + y * INPUT_W + x] = float(intensity.val[2]);
mInputCPU[imgPixels * 1 + y * INPUT_W + x] = float(intensity.val[1]);
mInputCPU[imgPixels * 2 + y * INPUT_W + x] = float(intensity.val[0]);
}
}

Thanks

Is this code inside the kernel function?
Can you swap the for() lines from:

for( uint32_t y=0; y < INPUT_H; y++)
    for( uint32_t x=0; x < INPUT_W; x++)

to

for( uint32_t x=0; x < INPUT_W; x++)
    for( uint32_t y=0; y < INPUT_H; y++)

and see if it makes any difference, in both mentioned image sizes?

I am not familiar with OpenCV, but just with a quick look, I’d like to ask if the assigment

cv::Vec3b intensity = dstImage.at<cv::Vec3b>(y, x);

should be

cv::Vec3b intensity = dstImage.at<cv::Vec3b>(x, y);

My first guess would be you are looping against the CPU cache. If this would cause a consistent drop from 1ms to 10ms with such small images is another story.
Maybe you want to have a look at: https://stackoverflow.com/questions/21950786/taking-r-g-b-individually-from-a-cvvec3b-vector