Help with batch inputs in test in Python API

I use Python API for TensorRT and I can’t find docs about batch inputs for Python. Based on C++ docs, I tried to make tests with TensorRT.

I trained simple model in Pytorch and I builded the engine based on it. Model input is (3,224,224) and output is (2) (probabilities for 2 classed). Engine’s max_batch_size is 8. TensorRT inference works fine for single image inputs. I prepared a 1D buffer with size 3224224 for an image and got 2 probabilities for two classes. Now I’m trying to make batch tests. I execute the engine with batch_size=4. I altered the sample buffer allocation:

h_input = pycuda.driver.pagelocked_empty(4*trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(ModelType.DTYPE))
h_output = pycuda.driver.pagelocked_empty(4*trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(ModelType.DTYPE))
d_input = pycuda.driver.mem_alloc(h_input.nbytes)
d_output = pycuda.driver.mem_alloc(h_output.nbytes)
stream = pycuda.driver.Stream()

I prepared images: resized, transposed channels, and put into numpy array (4,3,224,224). Then I prepared 1D pagelock buffer with size 43224*224 for 4 image batch:

norm_images = norm_images.as_type(trt.nptype(ModelData.DTYPE)).ravel
np.copyto(h_input,norm_images)

Inference:

pycuda.driver.memcpy_htod_async(d_input,h_input,stream)
context.execute_async(batch_size=4,bindings=[int(d_input), int(d_output)],stream_handle=stream.handle)
pycuda.driver.memcpy_dtoh_async(h_output,d_output,stream)
stream.synchronize()

After test I got a list with 4 same pairs of numbers. I guess I do it wrong. Can I find anywhere examples for batch tests in Python?

Have you solved the issue? I met a problem like yours, as I change the batchsize to more than 1, the output except the first sample are all zeros.