CUDA Crop float4* image

Hi,

I am trying to crop a section of a 1920x1080 image and pass that ROI through a neural net like in https://github.com/dusty-nv/jetson-inference/blob/master/detectnet-camera/detectnet-camera.cpp. We want to crop out an ROI of the imgRGBA image. We’ve found a crop function in https://github.com/dusty-nv/jetson-video/blob/master/cuda/cudaCrop.cu and tried to edit it to only crop an ROI like this:

template  <typename T>
__global__ void gpuCrop( T* input, int inputWidth, T* output, int outputWidth, int outputHeight, int xMin, int yMin )
{
	const int y = blockDim.y * blockIdx.y + threadIdx.y;
	const int x = blockDim.x * blockIdx.x + threadIdx.x;
	
	if( x >= outputWidth + xMin || y >= outputHeight + yMin || x < xMin || y < yMin)
		return;

        const T px = input[y * inputWidth + x];

	output[(y-yMin) * outputWidth + (x-xMin)] = px;
} 


cudaError_t cudaCrop( float4* input, const dim3& inputSize, float4* output, const dim3& outputSize, int xMin, int yMin )
{
	if( !input || !output )
		return cudaErrorInvalidDevicePointer;

	if( inputSize.x == 0 || inputSize.y == 0 || inputSize.z == 0 || outputSize.x == 0 || outputSize.y == 0 || outputSize.z == 0 ) 
		return cudaErrorInvalidValue;

	//const int inputAlignedWidth  = inputSize.z  / sizeof(uint8_t); - we commented these out, what do they do?
	//const int outputAlignedWidth = outputSize.z / sizeof(uint8_t);
	

	// launch kernel
	const dim3 blockDim(8, 8);
	const dim3 gridDim(iDivUp(outputSize.x,blockDim.x), iDivUp(outputSize.y,blockDim.y));

	gpuCrop<float4><<<gridDim, blockDim>>>(input, inputSize.x, output, outputSize.x, outputSize.y, xMin, yMin);

	return CUDA(cudaGetLastError());
}

However, when we run this, we get an image where the silhouettes of certain items in the frame are vaguely recognizable but everything is bright yellow or white. I’m also not entirely sure that it’s actually doing the cropping. What am I doing wrong?

Thank you so much!

Some things don’t look right to me, although its hard to be sure without a more complete test case:

Let’s suppose you are going to crop a 100x100 ROI in your image, and that ROI has an anchor point at 200,200 in the input image.

presumably, then outputWidth=outputHeight=100, and xMin=yMin = 200

Your code would then launch a grid of 100x100 threads, and the expectation is that each thread is responsible for a pixel in the ROI, and it should move that pixel to the appropriate location in the output image.

In the kernel, the variables x,y correspond to the location of each thread in the 100x100 grid of threads, and the x and y indices range from 0…99.

With that setup, does your thread check make sense?

if( x >= outputWidth + xMin || y >= outputHeight + yMin || x < xMin || y < yMin)
		return;

Do the math. Do any threads survive the return statement?

This kernel code may work better for you:

template  <typename T>
__global__ void gpuCrop( T* input, int inputWidth, T* output, int outputWidth, int outputHeight, int xMin, int yMin )
{
	const int y = blockDim.y * blockIdx.y + threadIdx.y;
	const int x = blockDim.x * blockIdx.x + threadIdx.x;
	
	if( x >= outputWidth || y >= outputHeight)
		return;

        const T px = input[(y+yMin) * inputWidth + (x+xMin)];

	output[y * outputWidth + x] = px;
}