A simple example on the custom layer API (TensorRT 2.1)?

Hello.

I am using TensorRT 2.1 and want to implement a simple custom layer. To practice, I wanted to make an “Inc” layer (just adding 1.0 to an input tensor values and keeping dimension the same).

I kept everything almost the same with the “class Reshape : public Iplugin” in sampleFasterRNN.cpp, except “getOutputDimensions()” to keep the same dimension. (this seems fine.)

Where should I implement the “adding 1.0” part? I guess it should be in “enqueue()”. So, I tried

int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override
{
  # the below is from the Reshape class. seems to copy from input to output
  CHECK(cudaMemcpyAsync(outputs[0], inputs[0], mCopySize * batchSize, cudaMemcpyDeviceToDevice, stream));
  # add 1.0 to first ten values
  float* foutputs = (float*) outputs[0];
  int i; for (i = 0; i < 10; i++) foutputs[i] += 1.0;	
  return 0;
}

However, this part results in “segmentation fault” error.

My questions are:
1) where and how can I implement some calculation between input and output (addition in this case)?
2) Can you provide a simple example?

** Just in case, I post the full code of this example API here. (almost same with the Reshape)

class Inc : public IPlugin
{
public:
  Inc() {}
  Inc(const void* buffer, size_t size)
  {
    assert(size == sizeof(mCopySize));
    mCopySize = *reinterpret_cast<const size_t*>(buffer);
  }

  int getNbOutputs() const override
  {
    return 1;
  }
  Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override
  {
    assert(nbInputDims == 1);
    assert(index == 0);
    assert(inputs[index].nbDims == 3);
    return DimsCHW(inputs[0].d[0], inputs[0].d[1], inputs[0].d[2]);  // same dimension
  }

  int initialize() override
  {
    return 0;
  }

  void terminate() override
  {
  }

  size_t getWorkspaceSize(int) const override
  {
    return 0;
  }

  // currently it is not possible for a plugin to execute "in place". Therefore we memcpy the data from the input to the output buffer
  int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override
  {
     CHECK(cudaMemcpyAsync(outputs[0], inputs[0], mCopySize * batchSize, cudaMemcpyDeviceToDevice, stream));
     float* foutputs = (float*) outputs[0];
     int i; for (i = 0; i < 10; i++)  foutputs[i] += 1.0; 
     return 0;
  }

  size_t getSerializationSize() override
  {
    return sizeof(mCopySize);
  }

  void serialize(void* buffer) override
  {
    *reinterpret_cast<size_t*>(buffer) = mCopySize;
  }

  void configure(const Dims*inputs, int nbInputs, const Dims* outputs, int nbOutputs, int)  override
  {
    mCopySize = inputs[0].d[0] * inputs[0].d[1] * inputs[0].d[2] * sizeof(float);
  }

protected:
  size_t mCopySize;
};

Hi,

First, you can try Power layer. This layer is supported by TensorRT.
For example,

layer {
  name: "deploy_transform"
  type: "Power"
  bottom: "data"
  top: "transformed_data"
  power_param {
    shift: -127.0
  }
}

We often use this layer to subtract image mean.

For plugin layer:
Yes, subtraction is doing in enqueue function.
But both input/output are GPU memory. You need to handle this via CUDA code.

Thanks.

Thanks. I didn’t know the Power layer. Nevertheless, my minimal example was just for understanding plugin layer. Could you provide an example of using CUDA code? or any link?

Hi,

Currently, we only have two samples to demonstrate Plugin API.(SamplePlugin and SampleFasterRCNN)
But neither of them use CUDA. (since they are not directly deal with the data value)

It should be like this:

__global__ void inc(float *input, float *output, int maxidx)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < maxidx) output[i] = input[i] + 1;
}

int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override
{
  inc<<<1, 10>>>(inputs[0], outputs[0], 10);
}

We are planning to write a more detail sample to demonstrate Plugin API.
Thanks for your feedback. This gives us more motivation to enable the more Plugin examples.

Thanks so much!!! Yes, I was curious if I can access and change values. Your code seems very helpful. I will try.

Hi AastaLLL,

I had compile errors: “error: ‘blockIdx’ was not declared in this scope”. (blockDim, threadIdx, too)

I found a reference saying “This problem will occur when you are writing cuda code that is inside a file named .cpp, and you go to compile it. Rename the file to .cu.”

Is it right?

So, I added two files, core.cu and core.h.

core.cu:

#include <stdio.h>
#include "core.h"

__global__ void inc_cuda(const float* input, float *output, int maxidx)
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	if (i < maxidx) output[i] = input[i] + 1;
}

void inc(int batchSize, const void*const *inputs, void** outputs, int maxidx)
{
	inc_cuda<<<1,maxidx>>>((const float*) inputs[0], (float*) outputs[0], maxidx);	
}

core.h:

void inc(int batchSize, const void*const *inputs, void** outputs, int maxidx);

In sampleFasterRCNN.cpp:

#include "core.h"
....
int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override
{
  inc(batchSize, inputs, outputs, 10);
}

By the way, I am not so sure how I can add the core.cu file to the existing Makefile.
Could you let me know how I can touch the Makefile in this case?

Hi,

Yes, kernel code should be written in .cu file.

For compilation, you can write Makefile like this:

# Build tools
CXX = g++-5
NVCC = /usr/bin/nvcc

# here are all the objects
GPUOBJS = cuexample.o 
OBJS = cppexample.o

# make and compile
cudaexample.out: $(OBJS) $(GPUOBJS)
$(NVCC) -o cudaexample.out $(OBJS) $(GPUOBJS)

cuexample.o: cuexample.cu
$(NVCC) -c cuexample.cu

cppexample.o: cppexample.cpp
$(CXX) -c cppexample.cpp
clean:
rm cppexample.o cuexample.o

Hi,

We have written a face-recognition sample to demonstrate TensorRT2.1 Plugin API.
Please check this GitHub for more details:

Hi,

Thanks for your sample code (Face-Recognition).
However, I am still curious about how to transfer my caffe layer code written in C
to Plugin. Could you please offer the caffe layer code written in C and I could
be more clear. Thx!

Hi,

Have you tried our jetson_inference sample?
TensorRT can import caffemodel directly.
You can check this tutorial for details.

If you are interested in Face-Recognition sample, you can change the input of TensorRT to your model.

Thanks.

Hi,

Thx for ur reply.

I wanna add my custom layer so I reference the sample code in Face-Recognition

but I am curious about things iner “BboxMergeLayer::BboxMergeLayer” in pluginImplement.cpp

ex.

dimsData = DimsCHW{d[0], d[1], d[2]};
dimsConf = DimsCHW{d[3], d[4], d[5]};
dimsBbox = DimsCHW{d[6], d[7], d[8]};

I don’t know where to find the detail.

Thanks!

Hi,

You can check tensorRT document for details.

Located at /usr/share/doc/tensorrt/

HI,

In faster-rcnn sample

I add

float* data = (float*)inputs[0];
std::cout<<“data:”<< data[0]<<std::endl;

in enqueue of reshape

but (core dumped)

why this error happened?? thx

Hi,

inputs[0] is a GPU memory pointer and can’t access with CPU.
Please copy the memory back to CPU first.

Thanks.

Hi,

ok~

the enqueue method is referenced to “face-recognition” pluginimplement.cpp

but how to copy the memory back to CPU?

where I can find in “face-recognition” sample?

Hi,

Use cudaMemcpy(data, d_data, size cudaMemcpyDeviceToHost) to copy memory back to host.
Here is an example for your reference:
https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

Face-recognition can be found here:

Hi,

but why in face-recognition sample, DataRoiLayer::enqueue

========================================
float* bbox = (float*)inputs[1];
int roi = { int(bbox[0]+0.5), int(bbox[1]+0.5), int(bbox[2]+0.5), int(bbox[3]+0.5)}; //rounding

input data can be directly accessed??

Hi,

Face-recognition uses unified memory, which can be accessed via CPU/GPU.
Faster-rcnn sample uses standard CUDA memory, can only be used via GPU.

Here is the document of unified memory:
[url]Programming Guide :: CUDA Toolkit Documentation

Hi,

“reshape” layer is implemented in IPlugin, so whether TRT support Float16 and INT8 in this layer?

my reshape layer is in the end of net, will it works if I use Float16 or INT8?

Hi,

Currently, Plugin API only support FP32 mode.
We will enable FP16 mode in our future release but no concrete schedule yet.

Thanks.