Using tensorRT to accelerate caffe model, but it take more time to inference

I use tensorRT to accelerate two caffe model(cifar10 and bvlc_inference_caffenet), and compare it with python and C++,the result is below:
for cifar10:
c++: 0.776791ms
python: 1.068496ms
tensorRT: 0.342586ms

for bvlc:
c++: 4.97181ms
python: 13.28289ms
tensorRT: 11.1584ms

My question is why tensorRT is slower than C++ for the bvlc model,my tensorRT timing code is

for(int i=0;i<10;++i)
        {
        CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * INPUT_C * sizeof(float), 
        cudaMemcpyHostToDevice, stream));
        context.enqueue(batchSize, buffers, stream, nullptr);
        CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE*sizeof(float), 
         cudaMemcpyDeviceToHost, stream));
        cudaStreamSynchronize(stream);
        }
        auto t_end = std::chrono::high_resolution_clock::now();
        auto ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
        std::cout<<"time duration is : "<< ms/10.0 << " ms"<<std::endl;

and my c++ timing code is :

struct timeval t_start,t_end;
  gettimeofday(&t_start,NULL);
  for(int i=0;i<10;++i){
        net_->Forward();
  }
  gettimeofday(&t_end,NULL);
  double ms = t_end.tv_sec*1000 + t_end.tv_usec / 1000.0 - t_start.tv_sec * 1000 - t_start.tv_usec / 1000.0;
  std::cout << "average time using is : " << ms/10.0 << "ms" << std::endl;

and my python timing code is:

start_time = time.time()
        for i in range(10):
                out = net.forward()
        end_time = time.time()
        print "inference time is: {} ms".format((end_time-start_time)*100)

am I wrong to using tensorRT???
PS:my tensorRT code is modified from sampleMNIST
my c++ code is modified from caffe/examples/cpp_classfication

anyone can give me the help?? hopes!

Hi,

From your source code, the duration for TensorRT includes:

  • 1. Memory copy host to the device
  • 2. TensorRT inference
  • 3. Memory copy device to host

, which is quite different from the Caffe part.

To measure the same inference time, it’s recommended to update your sample to this:

CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * INPUT_C * sizeof(float), cudaMemcpyHostToDevice, stream));

auto t_start = std::chrono::high_resolution_clock::now();
for(int i=0;i<10;++i)
{
    context.enqueue(batchSize, buffers, stream, nullptr);
}
auto t_end = std::chrono::high_resolution_clock::now();

auto ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
std::cout<<"time duration is : "<< ms/10.0 << " ms"<<std::endl;

CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);

Thanks.

Thanks! I tried it and get the Average time duration is 0.35ms , 10x faster than C++, I doubt the accuracy of the test,because one thing bothers me.

the caffe c++ inference : net_->Forward();
the caffe python inference : net.forward();
Is there no data move between device and host?

hope your reply~

does anybody knows?

Hi,

Depends on which computing mode you use.

For example, if with caffe.set_mode_cpu():
Data is written to the input buffer when normalization:
[url]caffe/classification.cpp at master · BVLC/caffe · GitHub

Since the buffer of normalization output and inference input is binding when initialization:
[url]caffe/classification.cpp at master · BVLC/caffe · GitHub

Thanks.