I use tensorRT to accelerate two caffe model(cifar10 and bvlc_inference_caffenet), and compare it with python and C++,the result is below:
for cifar10:
c++: 0.776791ms
python: 1.068496ms
tensorRT: 0.342586ms
for bvlc:
c++: 4.97181ms
python: 13.28289ms
tensorRT: 11.1584ms
My question is why tensorRT is slower than C++ for the bvlc model,my tensorRT timing code is
for(int i=0;i<10;++i)
{
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * INPUT_C * sizeof(float),
cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE*sizeof(float),
cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
}
auto t_end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
std::cout<<"time duration is : "<< ms/10.0 << " ms"<<std::endl;
and my c++ timing code is :
struct timeval t_start,t_end;
gettimeofday(&t_start,NULL);
for(int i=0;i<10;++i){
net_->Forward();
}
gettimeofday(&t_end,NULL);
double ms = t_end.tv_sec*1000 + t_end.tv_usec / 1000.0 - t_start.tv_sec * 1000 - t_start.tv_usec / 1000.0;
std::cout << "average time using is : " << ms/10.0 << "ms" << std::endl;
and my python timing code is:
start_time = time.time()
for i in range(10):
out = net.forward()
end_time = time.time()
print "inference time is: {} ms".format((end_time-start_time)*100)
am I wrong to using tensorRT???
PS:my tensorRT code is modified from sampleMNIST
my c++ code is modified from caffe/examples/cpp_classfication
anyone can give me the help?? hopes!
Hi,
From your source code, the duration for TensorRT includes:
1. Memory copy host to the device
2. TensorRT inference
3. Memory copy device to host
, which is quite different from the Caffe part.
To measure the same inference time, it’s recommended to update your sample to this:
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * INPUT_C * sizeof(float), cudaMemcpyHostToDevice, stream));
auto t_start = std::chrono::high_resolution_clock::now();
for(int i=0;i<10;++i)
{
context.enqueue(batchSize, buffers, stream, nullptr);
}
auto t_end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
std::cout<<"time duration is : "<< ms/10.0 << " ms"<<std::endl;
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
Thanks.
AastaLLL:
Hi,
From your source code, the duration for TensorRT includes:
1. Memory copy host to the device
2. TensorRT inference
3. Memory copy device to host
, which is quite different from the Caffe part.
To measure the same inference time, it’s recommended to update your sample to this:
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_H * INPUT_W * INPUT_C * sizeof(float), cudaMemcpyHostToDevice, stream));
auto t_start = std::chrono::high_resolution_clock::now();
for(int i=0;i<10;++i)
{
context.enqueue(batchSize, buffers, stream, nullptr);
}
auto t_end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration<float, std::milli>(t_end - t_start).count();
std::cout<<"time duration is : "<< ms/10.0 << " ms"<<std::endl;
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE*sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
Thanks.
Thanks! I tried it and get the Average time duration is 0.35ms , 10x faster than C++, I doubt the accuracy of the test,because one thing bothers me.
the caffe c++ inference : net_->Forward();
the caffe python inference : net.forward();
Is there no data move between device and host?
hope your reply~
Hi,
Depends on which computing mode you use.
For example, if with caffe.set_mode_cpu() :
Data is written to the input buffer when normalization:
[url]caffe/classification.cpp at master · BVLC/caffe · GitHub
Since the buffer of normalization output and inference input is binding when initialization:
[url]caffe/classification.cpp at master · BVLC/caffe · GitHub
Thanks.