Processing batches, offset in output


I am trying to process batches on my custom caffe model.
I downloaded this github repository and it works fine with usual model like googlenet, but not with mine, so I might have wrongly set something.

Let me explain, when I infer with a batch of 1 image, the output is fine but I noticed that when I infer with a batch of 2 or more images, there is an offset of 18 in the array


, even though its length is correct.
I have 6 labels, so the offset is 3 times my number of labels and I noticed a factor 3 in the code of the function


which I tried to remove. It worked well, but googlenet didn’t work anymore.

So I assume that googlenet should work fine and that I am setting something wrong. I am using basically the source code of the github repo, except from the loading of the image, which I do using openCV.

My images are 80x80 and B&W.
googlenet images have different size and 3 or 4 color channels, but I process them with the same code.

My openCV loading source code :

// I get the filename(s) from the command line and do this for each one
cv::Mat image;
image = cv::imread(filename);
uchar* camData = new uchar[*4];
cv::Mat imgRGBA(image.size(), CV_8UC4, camData);
cv::cvtColor(image, imgRGBA, CV_BGR2RGBA, 4);
cv::Mat imgFloat;
imgRGBA.convertTo(imgFloat, CV_32F);
// Then I use imgFloat to set the image width, height and to set cpuPtr content like in the original loadImageRGBA function

Is this here that I am doing some wrong stuff?

I can provide more details, I just don’t know what right now.

Thanks in advance.


Does your use cause only work on the custom branch?

If not, could you try if there is the same issue from our official jetson_inference sample?


Well, as far as I know, your official github does not allow batches of more than 1 images, does it?

But yes I tested on your official github with batches of 1 and it works as expected. I even tried to replace your file loadImage.cpp (and header) with mine and I got the same results, this is why I was not worried about my custom image loading file.

To clarify a little, I took the files required to build the imagenet-console target from Then I replaced the loadImage files with mine. I customized the CMake to fit my only target. Everything worked as expected.

Then, I learned about and wanted to integrate this feature in my custom repo, so I took the methods I needed from imageNet.h/cpp/cu and modified my imagenet-console.cpp to fit the new requirements. And I got this offset issue.

Could it be the image format? My custom model eats B&W, fixed sized images, while googlenet eats random sized, 3 or 4 channels images.

Also, I noticed that this github does not free the CPU pointer after each image but after each batch, is it relevant?

Because when I run this, I clearly see in the debug logs that for a batch size of n, the n-1 first images of the batch are not free’d (the memory allocation of the next batch begins where the n-th image was).

Edit: I tried using my B&W images with googlenet, and the results are consistent between two different batch sizes. So there has to be something wrong with my custom model. What can I provide so you can help me?


Want to clarify first:
You can get the expected results for gray images with jetson_inference but error with the custom branch, is it correct?
If yes, it’s recommended to enable jeston_inference for multiple batch support:

Modify here:



Clarification: I get erroneous results when I use a batch size >=2 on the custom branch. (erroneous meaning with a weird offset)

I enabled batch support on jetson_inference.

The results for a given image are sometimes different when I use a different batch size (e.g. 1 and 2).
I always put as many images in a batch as the maximum batch size.

Is there anything else I need to modify?


I figured the solution out.

My model eats grayscale images, so I had to build a custom loadImageBW function and I had to build another cuda function like cudaPreImageNetMean but without mean (image is pre-processed) and with adjustements when filling the output pointer.

See the cuda code below (code of loadImageBW is obvious but I can provide it if needed).

// gpuPreImageNetBatchBW
__global__ void gpuPreImageNetBatchBW( int numBatch, float2 scale, float* input, int iWidth, float* output, int oWidth, int oHeight )
	const int x = blockIdx.x * blockDim.x + threadIdx.x;
	const int y = blockIdx.y * blockDim.y + threadIdx.y;
	const int n = oWidth * oHeight;
	if( x >= oWidth || y >= oHeight )

	const int dx = ((float)x * scale.x);
	const int dy = ((float)y * scale.y);

	const float px  = input[ dy * iWidth + dx ];
	output[n * numBatch + y * oWidth + x] = px;

// cudaPreImageNetBatchBW
cudaError_t cudaPreImageNetBatchBW( int numBatch, float* input, size_t inputWidth, size_t inputHeight,
				             float* output, size_t outputWidth, size_t outputHeight )
	if( !input || !output )
		return cudaErrorInvalidDevicePointer;

	if( inputWidth == 0 || outputWidth == 0 || inputHeight == 0 || outputHeight == 0 )
		return cudaErrorInvalidValue;

	const float2 scale = make_float2( float(inputWidth) / float(outputWidth),
							    float(inputHeight) / float(outputHeight) );

	// launch kernel
	const dim3 blockDim(8, 8);
	const dim3 gridDim(iDivUp(outputWidth,blockDim.x), iDivUp(outputHeight,blockDim.y));

	gpuPreImageNetBatchBW<<<gridDim, blockDim>>>(numBatch, scale, input, inputWidth, output, outputWidth, outputHeight);

	return CUDA(cudaGetLastError());

Thank you for your time!

Thanks for your feedback : )