Is there any clearer official doc of customizing plugins in TensorRT?

The followings are the source I have already check.

  1. Official sample SampleUffSSD and SamplePlugin
  2. TensorRT Doc https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/c_api/_nv_infer_8h_source.html
  3. Nvidia Developer Guide https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#ipluginext

In my case, I am trying to implement tile and sign operations. Those two operation functions are just like in Tensorflow.

I know that I should implement two classes that inherit from IPluginV2 and IPluginCreator respectively and there is a list of virtual function to be implemented according to interfaces. However, the only clue I know what to fill in is the function name. For example, I would like to know where I should do the computation and when the member functions would be called.

Could you provide a smaller and clearer example to help me complete my plugins?

Hi,

You can check our uff_ssd sample: /usr/src/tensorrt/samples/python/uff_ssd/plugin
It create a model from TensorFlow and link the plugin FlattenConcat layer to create TensorRT.

It should be similar to your use case.
Thanks.

I can’t find the file under that path. Is it different from the SampleUffSSD sample?

Hi,

Which JetPack version do you use?
Our plugin API has updated recently. It’s recommended to use the latest JetPack4.2 with TensorRT5.0.

Thanks.

I use JetPack 4.1 with TensorRT5.0.
I found out the description here is very helpful.
https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#plugin_sample
However, the output of the plugin is not the same as I expected. I wonder if there is any debug tool to use since I can’t even print out any log in “enqueue” method.

FYI, I paste my code in enqueue function here.

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
	// Actual execution
	if (mDataType == DataType::kFLOAT) {
		printf("The operation is in 32 bit mode\n");
		float *dst = reinterpret_cast<float *>(outputs[0]);
		const float *src = reinterpret_cast<const float *>(inputs[0]);
		for (int r = 0; r < 300; r++, dst += 300) {
			CHECK(cudaMemcpy(dst, src, mInputDims.d[1] * sizeof(__half), cudaMemcpyDeviceToDevice));
		}
	} else {
		// plugin is set in FP16 mode
		__half *dst = reinterpret_cast<__half *>(outputs[0]);
		const __half *src = reinterpret_cast<const __half *>(inputs[0]);
		for (int r = 0; r < 300; r++, dst += 300) {
			CHECK(cudaMemcpy(dst, src, mInputDims.d[1] * sizeof(__half), cudaMemcpyDeviceToDevice));
		}
	}
}

In Python, this plugin is designed to make (1, 2048) tensor become (1, 2048 * 300) and replicate the original tensor to fill the rest space 299 times. Yet I found out there are several different places from the Python version.
First, when I print out inputDims in getOutputDimensions, it shows that the shape of the input tensor is (1, 1, 2048) and I don’t know why.
Second, the printf function or any method try to log out any info in enqueue method won’t work.

Hi,

You can enable the log with this function:

class Logger : public nvinfer1::ILogger
{
public:
    Logger(Severity severity = Severity::kINFO)
        : reportableSeverity(severity)
    {
    }

    void log(Severity severity, const char* msg) override
    {
        // suppress messages with severity enum value greater than the reportable
        if (severity > reportableSeverity)
            return;

        switch (severity)
        {
        case Severity::kINTERNAL_ERROR: std::cerr << "INTERNAL_ERROR: "; break;
        case Severity::kERROR: std::cerr << "ERROR: "; break;
        case Severity::kWARNING: std::cerr << "WARNING: "; break;
        case Severity::kINFO: std::cerr << "INFO: "; break;
        default: std::cerr << "UNKNOWN: "; break;
        }
        std::cerr << msg << std::endl;
    }

    Severity reportableSeverity;
};
static Logger gLogger;
...
nvinfer1::createInferBuilder(gLogger);

For tensor data debugging, you can mark it as output and check the value directly.

Thanks.

@AastaLLL
Thanks for replying. Btw I wonder where would the enqueue function be executed? Is it Host or Device side?

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
    if (mDataType == DataType::kFLOAT) {
          const float *src = reinterpret_cast<const float *>(inputs[0]);
          thrust::device_ptr<float> dst(reinterpret_cast<float *>(outputs[0]));
          for (int r = 0; r < mInputDims.d[2]; r++) {
              float el;
               CHECK(cudaMemcpy(&el, src,  sizeof(float), cudaMemcpyDeviceToDevice));
               thrust::fill(dst, dst+300, *src); // this line cause error.
          }
    } else {
        // .... code for FP16
    }
    return 0;
 }

After these days, I modify my code into the code segment as above. I switch the order of dimensions. Therefore, I need to fill a certain memory chunk with a float value. However, the error always occurs when I try to dereference the source pointer. I realize that it is illegal to dereference a device pointer in the Host side. Is there any suggestion to get the real float value to place in thrust::fill function?

Hi,

The enqueue function is a CPU call to launch the inference kernel function on the GPU.
The buffer passed into enqueue is GPU memory.

So please make sure your buffer is CPU accessible before feeding it into the thrust.

  • General memory (ex. malloc, cudaMalloc): copy the memory back to CPU first.
  • Unified memory (ex. cudaMallocManaged): synchronize before accessing.

Here is sample for your reference:
https://github.com/dusty-nv/jetson-inference/blob/f0c6264b9408d6ff844da3a74d313956e6c7cb47/imageNet.cpp#L351

Thanks.

I can see the enqueue function call is from the Host side. However, my goal is to prevent time waste on moving data from GPU to CPU and backward. If I use cudaMalloc to acquire memory, can I just deference the input buffer and assign the value into the allocated buffer?

I think I figure out how to place the computation in GPU. I move the context into a global function and looks like I successfully move it to device side.

__global__ gpuMemTile(const void *const *inputs, void **outputs) {
    thrust::device_ptr<float> dst(reinterpret_cast<float *>(outputs[0]));
    const float *src = reinterpret_cast<const float *>(inputs[0]);
    for (int r = 0; r < 2048; r++)
        thrust::fill(dst, dst+300, *src);
}

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
    if (mDataType == DataType::kFLOAT) {
        gpuMemTile<<<1, 16>>>(inputs, outputs);
    } else {
        // .... code for FP16
    }
    return 0;
 }

However, there are still several error during runtime and I couldn’t debug with cuda-dbg.

ERROR: engine.cpp (555) - Cuda Error in execute: 4
ERROR: engine.cpp (555) - Cuda Error in execute: 4

Are those error message related to model?

Hi,

I need more source code to give a suggestion.
But maybe you can try to add a synchronize call before the thrust first.

Like:

int enqueue(int batchSize, const void *const *inputs, void **outputs, void *workspace, cudaStream_t stream) {
    if (mDataType == DataType::kFLOAT) {
        cudaDeviceSychronize();
        gpuMemTile<<<1, 16>>>(inputs, outputs);
    } else {
...

Thanks.

I already send you my code to reproduce my error. Please let me know if you still lack some required info. Thanks for your help.

Hi,

Thanks for your source code.

Something occurs to me today. There is an known issue of the Thrust in JetPack4.2.
The default Thrust in CUDA toolkit is compiled in the incorrect architecture.
So you will need to build it from the source. (The wrong one always generate kernel error…)

Have you re-compiled the library first? If not, would you mind to give it a try?
https://github.com/dmlc/xgboost/issues/4355

Thanks.

Thanks for your remind. I use JetPack4.1 and everything goes fine. Thanks for all your help again and I finally solve my issue.

Good to know this!
Thanks for the update.