Where is frame/video Index come from in deepstream detection sample

In fact I have several question about the deepstream internal working flow, they are up TensorRT but below the DeepStream User code in technology stack.

1.Where is the frameIndex and videoIndex come from in detection sample of DeepStream?
Here is the code in parserModule_resnet18.h file:

bboxs.frameIndex = trace_0[iB].frameIndex;
bboxs.videoIndex = trace_0[iB].videoIndex;
bboxs.nBBox = 0;

Since the .prototxt file in TensorRT level only know the [BCHW] four dimension tensor, why can DeepStream retrieve the videoIndex from Inference Module’s output and route into Parser Module’s input?

2.Why Parse Module know pCov and pBBOX is in CPU(host memory)?
I learn the code from both deepstream’s detection and tensorrt’s faster-rcnn sample, the code in deepstream:

const float *pCov = reinterpret_cast<const float*>(vpInputTensors[0]->getConstCpuData());
std::vector<TRACE_INFO > trace_0 = vpInputTensors[0]->getTraceInfos();
const float *pBBOX = reinterpret_cast<const float*>(vpInputTensors[1]->getConstCpuData());

the code in tensorrt

CHECK(cudaMalloc(&buffers[outputIndex0], batchSize * nmsMaxOut * OUTPUT_BBOX_SIZE * sizeof(float))); // bbox_pred
CHECK(cudaMalloc(&buffers[outputIndex1], batchSize * nmsMaxOut * OUTPUT_CLS_SIZE * sizeof(float)));  // cls_prob
CHECK(cudaMalloc(&buffers[outputIndex2], batchSize * nmsMaxOut * 4 * sizeof(float)));                // rois
CHECK(cudaMemcpyAsync(outputBboxPred, buffers[outputIndex0], batchSize * nmsMaxOut * OUTPUT_BBOX_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(outputClsProb, buffers[outputIndex1], batchSize * nmsMaxOut * OUTPUT_CLS_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(outputRois, buffers[outputIndex2], batchSize * nmsMaxOut * 4 * sizeof(float), cudaMemcpyDeviceToHost, stream));

user code need copy host memory to and from cuda, so why no need these steps in DeepStream?

3.How to setup deepstream if the inference module use different data layout?
We have a .prototxt network, its input is four dimensions, which is [B*10*448*448], but the “channel” is collapsed from BGR to GRAY, channel don’t exist anymore, this dimension means 10 frames now. If B is 4, then each input of this tensor need 4*10=40 frames now in TensorRT level. Can DeepStream support this scenario? A short sentence: how about only 1 output for 10 frames input to inference engine?


1. Index information is set when decoding frames.

2. There is still a memcpy procedure but is handled by deepstream API.

3. If your use-case can be treated as 40 inputs (Batch=40), it should be able to inference with deepstream and TensorRT with the channel size=1.


Thanks for you reply, I want to discuss your idea about input data layout with deepstream.
3.The .prototxt support batch of input data, so the input layer should be:

name: "SomeNet"
layer {
  name: "data"
  type: "Input"
  top: "data"
  input_param { shape: { <b>dim: 40 dim: 1</b> dim: 448 dim: 448 } }

and the output layer can be

prob · Softmax
blob shapes
  prob: [ 4, 1000 ]

for classifier network, so every 40 frames input will make 4 outputs with 1000 class. But how can we inject our custom operation to generate image from [40, 1, 448, 448] to [4, 1, 448, 224](every ten 448x448 images make one 448x224 image output) since DeepStream hide the setPluginFactory function export by CaffeParser. Do you think it is possible to left the deepstream inference network input remain as [4, 1, 448, 224], but call addCustomerTask to add a custom module before inference task?

I discuss the input data layout of deepstream topic with my team leader, and the description in previous reply is not correct, so I add this reply to make things clear.
Since the deepstream handled the FRAME POOL and custom module is been add to flexible/analysis pipeline, Then what we want is: deepstream need input 10 frames or multiply of 10 to flexible pipeline for each epoch, the pseudocode may be:

OpenCVModule : public IModule{
  input: [40, 1, 448, 448];
  output: [4, 20, 448, 224];

main() {
  IDeviceWorker *pDeviceWorker = createDeviceWorker();
  IModule *pConvertor = pDeviceWorker->addColorSpaceConvertorTask(BGR_PLANAR);
  //add some opencv operation
  PRE_MODULE_LIST preModules_cv;
  preModules_cv.push_back(std::make_pair(pConvertor, 0)); // BGR_PLANAR
  OpenCVModule *pCV = new OpenCVModule(preModules_cv, ...);

  // Add inference task
  IModule *pInferModule = pDeviceWorker->addInferenceTask( std::make_pair(<b>pCV</b>, 0),
    nullptr, //meanFile,

  // Detection
  PRE_MODULE_LIST preModules_parser;
  preModules_parser.push_back(std::make_pair(pInfer, 0)); // prob: [4, 1000]
  ParserModule *pParser = new ParserModule(preModules_parser, ...);
  assert(nullptr != pParser);

4. Is it possible for deepstream to work with this analysis pipeline?
5. Should OpenCVModule::execute need clone the TRACE INFO from input stream tensor to output stream tensor by call setTraceInfo to maintain the frame and video index information?


Guess that there is some misunderstanding between us.

1.If your workflow is [B*10*448*448] -> [B*1000], the suggestion of #2 is not appropriate for your use case.
It assumes the batch image run independently and no cross-batch computation exist. So the output should be [10B*1000].

2. The dynamic input is not supported by TensorRT. We are checking the possibility but no concrete schedule.

3. YES


The channel number in [BCHW] layout is ten and it is constructed by ten gray images out of TensorRT or DeepStream scope, The TensorRT can ignore the “C” meaning, but DeepStream need feed the “C” with RGB color space and drive analysis pipeline with dynamic frame number.


Sounds cool!

Quick try with MNIST network:
Setting the input image to 10x28x28 and TensorRT can inference it correctly.
Looks this idea is workable.

Welcome to let us know if you have further update on this.

Sorry for the later reply. I study the MNIST example of TensorRT, and only the batch is changeable at runtime:

input: "data"
input_shape {
  dim: 1
  dim: <b>1</b>
  dim: 28
  dim: 28

CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, <b>buffers</b>, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream));

hi haifengli

I’m dealing with same problem now, could you send me user defined module “OpenCVModule” ?