Tensor RT memory copy

Hi,I am using Tensor RT on Jetson TX2 for live video parsing.As for memory copy,I compare two ways, CUDA zero copy and cudaMemcpy. The model I use is googlenet, the data type is FP32

  1. with CUDA zero copy,I do
    context.excute(1, buffers);
    this takes about 20ms.
  2. with cudaMemcpy,
    the time I count includes cudaMalloc, cudaMemcpy, context.excute(1, buffers), this in total takes 16ms.
    Is there any problem? Shouldn’t CUDA zero copy be faster than cudaMemcpy?

Also, I count the time with caffe-cudnn5, and it takes 21 ms, which is almost the same with Tensor RT+CUDA zero copy.
So, could you please give me some advise on memory copy with Tensor RT for live video parsing?
Thanks.

Hi maoxiuping, how are you timing the events? Also are you performing any device synchronization.

Normally for live video processing with DNN, the DNN (like Alexnet or Googlenet derivatives) expects planar BGR format, but the video input format is YUV/NV12/RGBA/ect, so a colorspace conversion kernel is required.

The input to the colorspace conversion kernel can use zeroCopy, since the shared memory benefits working with the video capture device on CPU, but then the output of colorspace conversion is device global memory allocated with cudaMalloc(). This then in turn gets fed into TensorRT.

The code I use comes from https://github.com/dusty-nv/jetson-inference
in imageNet.cpp in function Classify, I add this

            // downsample and convert to band-sequential BGR
       if( CUDA_FAILED(cudaPreImageNetMean((float4*)rgba, width, height, mInputCUDA, mWidth, mHeight,
							 make_float3(104.0069879317889f, 116.66876761696767f, 122.6789143406786f))) )
        {
	     printf("imageNet::Classify() -- cudaPreImageNetMean failed\n");
	     return -1;
        }

            void* inferenceBuffers[] = { mInputCUDA, mOutputs[0].CUDA };

            float forward_ms;
            auto t_start = std::chrono::high_resolution_clock::now();

	mContxt->excute(1,inferenceBuffers );
					
            auto t_end = std::chrono::high_resolution_clock::now();
        forward_ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
            std::cout << "forward run time  is " << forward_ms << " ms." << std::endl;

With googlenet, the forward_ms is about 20ms.
The video input format is RGB, and has been converted to BGR, I think input(rgba) to the colorspace conversion kernel is is device global memory allocated with cudaMalloc(), and the output of colorspace conversion, mInputCUDA, uses zeroCopy in above code.

Instead of CPU timer, you should probably try cudaEvents and cudaEventElapsedTime() here: http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#axzz4lUvuzOyi

In my repo I should probably make it so that the video capture input buffer is allocated with zeroCopy flag, but further intermediate buffers are not.

I tried cudaEvents in gie_samples/samples/giexec.cpp

for (int j = 0; j < gParams.iterations; j++)
{
float total = 0, ms;
for (int i = 0; i < gParams.avgRuns; i++)
{
if (gParams.hostTime)
{
auto t_start = std::chrono::high_resolution_clock::now();
context->execute(gParams.batchSize, &buffers[0]);
auto t_end = std::chrono::high_resolution_clock::now();
ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
}
else
{
cudaEventRecord(start, stream);//
context->enqueue(gParams.batchSize, &buffers[0], stream, nullptr);
cudaEventRecord(end, stream);//
cudaEventSynchronize(end);//
cudaEventElapsedTime(&ms, start, end);
}
total += ms;
}
total /= gParams.avgRuns;
std::cout << “Average over " << gParams.avgRuns << " runs is " << total << " ms.” << std::endl;
}

I tested with batch size=1, there is nearly no difference between hostTime and cudaEvents.

OK, that may be because the non-asyncronous version of TensorRT enqueue() was used. In the future I will try changing the repo to not use ZeroCopy memory for intermediate buffers. You could try changing it yourself for the time being, replace the buffer that is the output of the color conversion (i.e. input to TensorRT) with memory allocated by cudaMalloc()

Yes, I have replaced the buffer with memory allocated by cudaMalloc() for the output of the color conversion. For googelnet, the forward time is reduced from 20ms to 15ms, however, for segmentation, FCN net, the forward time is still almost the same. Does Tensor RT optimize especially for googlenet?

No, but it might turn out that a layer Googlenet uses has a memory access pattern not ideal for zeroCopy. FCN-Alexnet for segmentation is based on Alexnet as opposed to Googlenet). The imagenet-console example prints out layer timings, so you could see if a specific layer is slower. See imagenet-console.cpp:42 for how to enable it:

net->EnableProfiler();