How to convert for loop increment by more than one into CUDA

I have tried to extract patches from an image parallelly with pixel shift/overlapping. I have written the CPU version of the code. But I could not able to convert the for loop which has an increment of pixel shift. I have given the part of the code where for loop is being used. I have enclosed the CPU version the code. Please help me out to convert the code into cuda.Reconstruction_PE.cpp (4.4 KB)

   int num_of_patch_image = num_of_patch_row * num_of_patch_col;
    for (int i = 0; i < height; i += pixel_shift) {
	  int counter_col = 0;
	   for (int j = 0; j < width; j += pixel_shift) {
		for (int ii = 0; ii < patch_size; ii++) {
			for (int jj = 0; jj < patch_size; jj++) {
				if ((i + ii) < height && (j + jj) < width) {

					patch_data[num_of_patch_image * (patch_size * ii + jj) + num_of_patch_col*counter_row + counter_col] = (double)original_data[width*(i + ii) + (j + jj)];

				}
				else {
					patch_data[num_of_patch_image * (patch_size * ii + jj) + num_of_patch_col*counter_row + counter_col] = 0.;
				}

			}
		}

		counter_col++;
		if (counter_col == num_of_patch_col) {
			break;
		}

	}
	counter_row++;
	if (counter_row == num_of_patch_row) {
		break;
	}
}

}

Why wouldn’t you be able to convert this to CUDA? Could you be more specific?

related:

https://stackoverflow.com/questions/65261824/while-launched-the-kernel-error-checking-api-showing-unspecified-launch-failure