In fact I have several question about the deepstream internal working flow, they are up TensorRT but below the DeepStream User code in technology stack.
1.Where is the frameIndex and videoIndex come from in detection sample of DeepStream?
Here is the code in parserModule_resnet18.h file:
BBOXS_PER_FRAME bboxs;
bboxs.frameIndex = trace_0[iB].frameIndex;
bboxs.videoIndex = trace_0[iB].videoIndex;
bboxs.nBBox = 0;
Since the .prototxt file in TensorRT level only know the [BCHW] four dimension tensor, why can DeepStream retrieve the videoIndex from Inference Module’s output and route into Parser Module’s input?
2.Why Parse Module know pCov and pBBOX is in CPU(host memory)?
I learn the code from both deepstream’s detection and tensorrt’s faster-rcnn sample, the code in deepstream:
const float *pCov = reinterpret_cast<const float*>(vpInputTensors[0]->getConstCpuData());
std::vector<TRACE_INFO > trace_0 = vpInputTensors[0]->getTraceInfos();
const float *pBBOX = reinterpret_cast<const float*>(vpInputTensors[1]->getConstCpuData());
the code in tensorrt
CHECK(cudaMalloc(&buffers[outputIndex0], batchSize * nmsMaxOut * OUTPUT_BBOX_SIZE * sizeof(float))); // bbox_pred
CHECK(cudaMalloc(&buffers[outputIndex1], batchSize * nmsMaxOut * OUTPUT_CLS_SIZE * sizeof(float))); // cls_prob
CHECK(cudaMalloc(&buffers[outputIndex2], batchSize * nmsMaxOut * 4 * sizeof(float))); // rois
...
CHECK(cudaMemcpyAsync(outputBboxPred, buffers[outputIndex0], batchSize * nmsMaxOut * OUTPUT_BBOX_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(outputClsProb, buffers[outputIndex1], batchSize * nmsMaxOut * OUTPUT_CLS_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
CHECK(cudaMemcpyAsync(outputRois, buffers[outputIndex2], batchSize * nmsMaxOut * 4 * sizeof(float), cudaMemcpyDeviceToHost, stream));
user code need copy host memory to and from cuda, so why no need these steps in DeepStream?
3.How to setup deepstream if the inference module use different data layout?
We have a .prototxt network, its input is four dimensions, which is [B*10*448*448], but the “channel” is collapsed from BGR to GRAY, channel don’t exist anymore, this dimension means 10 frames now. If B is 4, then each input of this tensor need 4*10=40 frames now in TensorRT level. Can DeepStream support this scenario? A short sentence: how about only 1 output for 10 frames input to inference engine?