You might need to provide a complete code to diagnose whatever is going on. Your kernel should require about 800 microseconds, not 800 milliseconds. Here is a fully worked test case around the kernel you have shown:
$ cat t297.cu
#include <cstdio>
__global__ void applyROI(unsigned char * inputImage, int width, int height, int components, int step_input, unsigned char *outputImage, int step_output, unsigned char * roi, int step_roi) {
if (threadIdx.x + (blockDim.x * blockIdx.x) >= width) return;
int global_idx = threadIdx.x + (blockDim.x * blockIdx.x);
int global_idy = threadIdx.y + (blockDim.y * blockIdx.y);
int roiIndex = global_idx + (step_roi * global_idy);
int inputIndex = (global_idx + (step_input / 4 * global_idy));
int outputIndex = (global_idx + (step_output / 4 * global_idy));
if (roi[roiIndex] != 0) {
int *out = (int*)outputImage;
int *in = (int*)inputImage;
out[outputIndex] = in[inputIndex];
} else {
int *out = (int*)outputImage;
out[outputIndex] = 0;
}
}
int main(){
int x = 3640;
int y = 2160;
int nc = 4;
unsigned char *idata, *odata, *roi;
cudaMalloc(&idata, x*y*nc);
cudaMalloc(&odata, x*y*nc);
cudaMalloc(&roi, x*y);
cudaMemset(idata, 8, x*y*nc);
cudaMemset(odata, 0, x*y*nc);
cudaMemset(roi, 1, x*y);
dim3 block(16,16);
dim3 grid((x+block.x-1)/block.x, y/block.y);
applyROI<<<grid , block >>>(idata, x, y, nc, nc*x, odata, nc*x, roi, x);
unsigned char *result = new unsigned char[x*y*nc];
cudaMemcpy(result, odata, x*y*nc, cudaMemcpyDeviceToHost);
for (int i = 0; i < x*y*nc; i++) if (result[i] != 8){printf("mismatch at %d, was %d, should be %d\n", i, (int)(result[i]), 8); return -1;}
return 0;
}
$ nvcc t297.cu -o t297
$ cuda-memcheck ./t297
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./t297
==14027== NVPROF is profiling process 14027, command: ./t297
==14027== Profiling application: ./t297
==14027== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 98.97% 17.077ms 1 17.077ms 17.077ms 17.077ms [CUDA memcpy DtoH]
1.00% 173.25us 1 173.25us 173.25us 173.25us applyROI(unsigned char*, int, int, int, int, unsigned char*, int, unsigned char*, int)
0.02% 3.6800us 3 1.2260us 1.0560us 1.5680us [CUDA memset]
API calls: 93.05% 342.57ms 3 114.19ms 354.85us 341.80ms cudaMalloc
4.83% 17.793ms 1 17.793ms 17.793ms 17.793ms cudaMemcpy
1.41% 5.2034ms 384 13.550us 349ns 574.74us cuDeviceGetAttribute
0.51% 1.8905ms 4 472.64us 279.29us 750.10us cuDeviceTotalMem
0.13% 466.85us 4 116.71us 105.22us 138.24us cuDeviceGetName
0.04% 152.62us 3 50.872us 28.787us 88.425us cudaMemset
0.01% 33.038us 1 33.038us 33.038us 33.038us cudaLaunchKernel
0.01% 24.609us 4 6.1520us 4.1290us 9.2150us cuDeviceGetPCIBusId
0.00% 9.4270us 8 1.1780us 497ns 2.4920us cuDeviceGet
0.00% 6.3100us 3 2.1030us 343ns 3.6770us cuDeviceGetCount
0.00% 2.6990us 4 674ns 633ns 713ns cuDeviceGetUuid
$
The profiler indicates the kernel duration as less than 200us on my Tesla P100. Your Quadro P4000 will be slower than that, but should not be more than 4x slower. (if, when you wrote 800ms, you actually meant 800 microseconds, then your results may be plausible.)
Perhaps you should try running and profiling my complete code on your system. If you still need help, provide a complete code like I have done, and also provide the complete command line you are using to compile the code, as well as the profiler output.