Is there any clearer official doc of customizing plugins in TensorRT?

posutsai · June 18, 2019, 5:53am

The followings are the source I have already check.

Official sample SampleUffSSD and SamplePlugin
TensorRT Doc https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/_nv_infer_8h_source.html
Nvidia Developer Guide https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#ipluginext

In my case, I am trying to implement tile and sign operations. Those two operation functions are just like in Tensorflow.

tf.tile https://www.tensorflow.org/api_docs/python/tf/tile
tf.math.sign https://www.tensorflow.org/api_docs/python/tf/tile

I know that I should implement two classes that inherit from IPluginV2 and IPluginCreator respectively and there is a list of virtual function to be implemented according to interfaces. However, the only clue I know what to fill in is the function name. For example, I would like to know where I should do the computation and when the member functions would be called.

Could you provide a smaller and clearer example to help me complete my plugins?

AastaLLL · June 18, 2019, 7:18am

Hi,

You can check our uff_ssd sample: /usr/src/tensorrt/samples/python/uff_ssd/plugin
It create a model from TensorFlow and link the plugin FlattenConcat layer to create TensorRT.

It should be similar to your use case.
Thanks.

posutsai · June 18, 2019, 7:23am

I can’t find the file under that path. Is it different from the SampleUffSSD sample?

AastaLLL · June 19, 2019, 2:10am

Hi,

Which JetPack version do you use?
Our plugin API has updated recently. It’s recommended to use the latest JetPack4.2 with TensorRT5.0.

Thanks.

posutsai · June 20, 2019, 4:21am

I use JetPack 4.1 with TensorRT5.0.
I found out the description here is very helpful.
[url]https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#plugin_sample[/url]
However, the output of the plugin is not the same as I expected. I wonder if there is any debug tool to use since I can’t even print out any log in “enqueue” method.

posutsai · June 20, 2019, 4:56am

FYI, I paste my code in enqueue function here.

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
	// Actual execution
	if (mDataType == DataType::kFLOAT) {
		printf("The operation is in 32 bit mode\n");
		float *dst = reinterpret_cast<float *>(outputs[0]);
		const float *src = reinterpret_cast<const float *>(inputs[0]);
		for (int r = 0; r < 300; r++, dst += 300) {
			CHECK(cudaMemcpy(dst, src, mInputDims.d[1] * sizeof(__half), cudaMemcpyDeviceToDevice));
		}
	} else {
		// plugin is set in FP16 mode
		__half *dst = reinterpret_cast<__half *>(outputs[0]);
		const __half *src = reinterpret_cast<const __half *>(inputs[0]);
		for (int r = 0; r < 300; r++, dst += 300) {
			CHECK(cudaMemcpy(dst, src, mInputDims.d[1] * sizeof(__half), cudaMemcpyDeviceToDevice));
		}
	}
}

In Python, this plugin is designed to make (1, 2048) tensor become (1, 2048 * 300) and replicate the original tensor to fill the rest space 299 times. Yet I found out there are several different places from the Python version.
First, when I print out inputDims in getOutputDimensions, it shows that the shape of the input tensor is (1, 1, 2048) and I don’t know why.
Second, the printf function or any method try to log out any info in enqueue method won’t work.

AastaLLL · June 25, 2019, 8:13am

Hi,

You can enable the log with this function:

class Logger : public nvinfer1::ILogger
{
public:
    Logger(Severity severity = Severity::kINFO)
        : reportableSeverity(severity)
    {
    }

    void log(Severity severity, const char* msg) override
    {
        // suppress messages with severity enum value greater than the reportable
        if (severity > reportableSeverity)
            return;

        switch (severity)
        {
        case Severity::kINTERNAL_ERROR: std::cerr << "INTERNAL_ERROR: "; break;
        case Severity::kERROR: std::cerr << "ERROR: "; break;
        case Severity::kWARNING: std::cerr << "WARNING: "; break;
        case Severity::kINFO: std::cerr << "INFO: "; break;
        default: std::cerr << "UNKNOWN: "; break;
        }
        std::cerr << msg << std::endl;
    }

    Severity reportableSeverity;
};
static Logger gLogger;
...
nvinfer1::createInferBuilder(gLogger);

For tensor data debugging, you can mark it as output and check the value directly.

Thanks.

posutsai · June 25, 2019, 11:35pm

@AastaLLL
Thanks for replying. Btw I wonder where would the enqueue function be executed? Is it Host or Device side?

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
    if (mDataType == DataType::kFLOAT) {
          const float *src = reinterpret_cast<const float *>(inputs[0]);
          thrust::device_ptr<float> dst(reinterpret_cast<float *>(outputs[0]));
          for (int r = 0; r < mInputDims.d[2]; r++) {
              float el;
               CHECK(cudaMemcpy(&el, src,  sizeof(float), cudaMemcpyDeviceToDevice));
               thrust::fill(dst, dst+300, *src); // this line cause error.
          }
    } else {
        // .... code for FP16
    }
    return 0;
 }

After these days, I modify my code into the code segment as above. I switch the order of dimensions. Therefore, I need to fill a certain memory chunk with a float value. However, the error always occurs when I try to dereference the source pointer. I realize that it is illegal to dereference a device pointer in the Host side. Is there any suggestion to get the real float value to place in thrust::fill function?

AastaLLL · June 26, 2019, 3:01am

Hi,

The enqueue function is a CPU call to launch the inference kernel function on the GPU.
The buffer passed into enqueue is GPU memory.

So please make sure your buffer is CPU accessible before feeding it into the thrust.

General memory (ex. malloc, cudaMalloc): copy the memory back to CPU first.
Unified memory (ex. cudaMallocManaged): synchronize before accessing.

Here is sample for your reference:
https://github.com/dusty-nv/jetson-inference/blob/f0c6264b9408d6ff844da3a74d313956e6c7cb47/imageNet.cpp#L351

Thanks.

posutsai · June 26, 2019, 4:04am

I can see the enqueue function call is from the Host side. However, my goal is to prevent time waste on moving data from GPU to CPU and backward. If I use cudaMalloc to acquire memory, can I just deference the input buffer and assign the value into the allocated buffer?

posutsai · June 28, 2019, 3:54am

I think I figure out how to place the computation in GPU. I move the context into a global function and looks like I successfully move it to device side.

__global__ gpuMemTile(const void *const *inputs, void **outputs) {
    thrust::device_ptr<float> dst(reinterpret_cast<float *>(outputs[0]));
    const float *src = reinterpret_cast<const float *>(inputs[0]);
    for (int r = 0; r < 2048; r++)
        thrust::fill(dst, dst+300, *src);
}

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
    if (mDataType == DataType::kFLOAT) {
        gpuMemTile<<<1, 16>>>(inputs, outputs);
    } else {
        // .... code for FP16
    }
    return 0;
 }

However, there are still several error during runtime and I couldn’t debug with cuda-dbg.

ERROR: engine.cpp (555) - Cuda Error in execute: 4
ERROR: engine.cpp (555) - Cuda Error in execute: 4

Are those error message related to model?

AastaLLL · June 28, 2019, 10:08am

Hi,

I need more source code to give a suggestion.
But maybe you can try to add a synchronize call before the thrust first.

Like:

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
    if (mDataType == DataType::kFLOAT) {
        cudaDeviceSychronize();
        gpuMemTile<<<1, 16>>>(inputs, outputs);
    } else {
...

Thanks.

posutsai · July 2, 2019, 2:26am

I already send you my code to reproduce my error. Please let me know if you still lack some required info. Thanks for your help.

AastaLLL · July 2, 2019, 7:00am

Hi,

Thanks for your source code.

Something occurs to me today. There is an known issue of the Thrust in JetPack4.2.
The default Thrust in CUDA toolkit is compiled in the incorrect architecture.
So you will need to build it from the source. (The wrong one always generate kernel error…)

Have you re-compiled the library first? If not, would you mind to give it a try?
https://github.com/dmlc/xgboost/issues/4355

Thanks.

posutsai · July 2, 2019, 3:39pm

Thanks for your remind. I use JetPack4.1 and everything goes fine. Thanks for all your help again and I finally solve my issue.

AastaLLL · July 3, 2019, 2:09am

Good to know this!
Thanks for the update.

Topic		Replies	Views
getPluginCreator could not find plugin BatchedNMS_TRT version 1 TensorRT	5	4011	December 23, 2020
A simple example on the custom layer API (TensorRT 2.1)? Jetson TX2	28	6574	October 18, 2021
TensorRT ( C++ ) inference strange behavior on Jetson AGX Xavier TensorRT cudnn	0	18	January 15, 2025
enqueue error TensorRT	8	2306	December 3, 2019
The inference result based on transferred mode from tensorrt 4 with fcn8s is very terrible TensorRT	16	1730	October 12, 2021
A bus error had been occurred while I use addPlugin API in TensorRT TensorRT	1	831	January 18, 2019
How to use addPlugin( ) in tensorRT2.1? Jetson TX2	11	1869	October 18, 2021
Tensorflow 1.7 with TensorRT fails Jetson TX2	13	3821	October 18, 2021
Failed to convert engine to fp16, by using setFp16Mode TensorRT tensorrt	3	1088	October 20, 2020
Could not find any supported formats consistent with input/output data types TensorRT	1	854	April 11, 2023

Is there any clearer official doc of customizing plugins in TensorRT?

Related topics