Cuda Memory Error when enabling the DLA

Hey All,

When I try to build a trt plan with a max batchsize greater than 2 I get no errors but when I try to run it I get this one

dla/nvmRegion.cpp:474: nvinfer1::rt::dla::NvmItem::~NvmItem(): Assertion `!mCudaMemory || !mNvmTensor' failed.

Here’s my code for enabling the dla, which is pretty standard.

builder->setMaxBatchSize(maxBatchSize);
  builder->setMaxWorkspaceSize(maxWorkspaceSize);

  builder->setDefaultDeviceType(DeviceType::kDLA);
  builder->setDLACore(1);
  builder->allowGPUFallback(true);

I also get a bunch of these errors earlier on in my execution program

dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
dla/dlaUtils.cpp (536) - DLA Error in submit: 7 (Failure to submit program to DLA engine.)
NVMEDIA_DLA : 1098, ERROR: setInputTensorDesc failed 
NVMEDIA_DLA : 1194, ERROR: SetInputTensorDesc failed for tensor: 7. status: 0x0.

which makes it look like I can’t submit my input tensor to the DLA, thoughts on why this is happening? and only for Batchsize > 2?

Hi,

May I know which TensorRT version do you use first?
Thanks.

Hey Aasta,

its TensorRT 5.0.3.2 with Cuda-10.0

Hi,

There is a similar bug and is fixed last year.
Would you mind to reflash your device with JetPack4.2, which will give you TensorRT5.0.6.3.

Thanks.

Hey Aasta

I just reflashed Friday, so I’ll give it a gander and let you know if that solved it.

Thanks

Hey Aasta,

I just checked and I can confirm that the problem still exists with JetPack4.2 and TensorRT 5.0.6.3. Thoughts on how to fix this then?

Additional Info: I am running both the DLA and GPU together, as GPU fallback is on and the DLA doesn’t work on all the layers, Current Model is Vgg_16 from the TensorFlow slim model zoo, however I have noticed this problem on Inception_v1, and Resnet_v1_50 both also from the TensorFlow slim model zoo. It only occurs when trying to trying to run a TensorRT plan with a max batchsize > 1, regardless if the batchsize during runtime is greater than 1 or not.

Hi,

Do you have a sample can help us reproduce this issue?
We want to forward this to our internal team to get some further suggestion.

Thanks.

Hey Aasta,

Ive been using this github repo: https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification with edits made to the uff_to_plan.cpp file as below:

#include <iostream>
#include <string>
#include <sstream>
#include <fstream>

#include <NvInfer.h>
#include <NvUffParser.h>

using namespace std;
using namespace nvinfer1;
using namespace nvuffparser;

class Logger : public ILogger
{
  void log(Severity severity, const char * msg) override
  {
      cout << msg << endl;
  }
} gLogger;

int toInteger(string value)
{
  int valueInteger;
  stringstream ss;
  ss << value;
  ss >> valueInteger;
  return valueInteger;
}

DataType toDataType(string value)
{
  if (value == "float")
    return DataType::kFLOAT;
  else if (value == "half")
    return DataType::kHALF;
  else
    throw runtime_error("Unsupported data type");
}

int main(int argc, char *argv[])
{
  if (argc != 10)
  {
    cout << "Usage: <uff_filename> <plan_filename> <input_name> <input_height> <input_width>"
      << " <output_name> <max_batch_size> <max_workspace_size> <data_type>\n";
    return 1;
  }

  /* parse command line arguments */
  string uffFilename = argv[1];
  string planFilename = argv[2];
  string inputName = argv[3];
  int inputHeight = toInteger(argv[4]);
  int inputWidth = toInteger(argv[5]);
  string outputName = argv[6];
  int maxBatchSize = toInteger(argv[7]);
  int maxWorkspaceSize = toInteger(argv[8]);
  DataType dataType = toDataType(argv[9]);

  /* parse uff */
  IBuilder *builder = createInferBuilder(gLogger);
  INetworkDefinition *network = builder->createNetwork();
  IUffParser *parser = createUffParser();
  parser->registerInput(inputName.c_str(), DimsCHW(3, inputHeight, inputWidth), UffInputOrder::kNCHW);
  parser->registerOutput(outputName.c_str());
  if (!parser->parse(uffFilename.c_str(), *network, dataType))
  {
    cout << "Failed to parse UFF\n";
    builder->destroy();
    parser->destroy();
    network->destroy();
    return 1;
  }

  /* build engine */
  if (dataType == DataType::kHALF)
    builder->setHalf2Mode(true);

  builder->setMaxBatchSize(maxBatchSize);
  builder->setMaxWorkspaceSize(maxWorkspaceSize);
  
  builder->setDefaultDeviceType(DeviceType::kDLA);
  builder->setDLACore(0);
  builder->setFp16Mode(true);
  builder->allowGPUFallback(true);
  
  ICudaEngine *engine = builder->buildCudaEngine(*network);

/* serialize engine and write to file */
  ofstream planFile;
  planFile.open(planFilename);
  IHostMemory *serializedEngine = engine->serialize();
  planFile.write((char *)serializedEngine->data(), serializedEngine->size());
  planFile.close(); 
  
  /* break down */
  builder->destroy();
  parser->destroy();
  network->destroy();
  engine->destroy();
  serializedEngine->destroy();

  return 0;
}

And edits made to the test_trt.cu file as below:

/**
 * Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
 * Full license terms provided in LICENSE.md file.
 */

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <chrono>
#include <stdexcept>
#include <fstream>

#include <cuda_runtime.h>
#include <opencv2/opencv.hpp>
#include <NvInfer.h>

#define MS_PER_SEC 1000.0

using namespace std;
using namespace nvinfer1;

class TestConfig;

typedef void (*preprocess_fn_t)(float *input, size_t channels, size_t height, size_t width, int batchsize);
void preprocessVgg(float *input, size_t channels, size_t height, size_t width, int batchsize);
void preprocessInception(float *input, size_t channels, size_t height, size_t width, int batchsize);
size_t argmax(float *input, size_t numel, int batchsize);
void test(const TestConfig &testConfig);

class TestConfig
{
public:
  string imagePath;
  string planPath;
  string inputNodeName;
  string outputNodeName;
  string preprocessFnName;
  string inputHeight;
  string inputWidth;
  string numOutputCategories;
  string dataType;
  string maxBatchSize;
  string workspaceSize;
  string numRuns;
  string useMappedMemory;
  string statsPath;

  TestConfig(int argc, char * argv[])
  {
    imagePath = argv[1];
    planPath = argv[2];
    inputNodeName = argv[3];
    inputHeight = argv[4];
    inputWidth = argv[5];
    outputNodeName = argv[6];
    numOutputCategories = argv[7];
    preprocessFnName = argv[8];
    numRuns = argv[9];
    dataType = argv[10];
    maxBatchSize = argv[11];
    workspaceSize = argv[12];
    useMappedMemory = argv[13];
    statsPath = argv[14];
	//imagePath = "/home/nrwv46/ml_sandbox/data/newfie0.JPEG";
	
  }

  static string UsageString()
  {
    string s = "";
    s += "imagePath: \n";
    s += "planPath: \n";
    s += "inputNodeName: \n";
    s += "inputHeight: \n";
    s += "inputWidth: \n";
    s += "outputNodeName: \n";
    s += "numOutputCategories: \n";
    s += "preprocessFnName: \n";
    s += "numRuns: \n";
    s += "dataType: \n";
    s += "maxBatchSize: \n";
    s += "workspaceSize: \n";
    s += "useMappedMemory: \n";
    s += "statsPath: \n";
    return s;

  }

  string ToString()
  {
    string s = "";
    s += "imagePath: " + imagePath + "\n";
    s += "planPath: " + planPath + "\n";
    s += "inputNodeName: " + inputNodeName + "\n";
    s += "inputHeight: " + inputHeight + "\n";
    s += "inputWidth: " + inputWidth + "\n";
    s += "outputNodeName: " + outputNodeName + "\n";
    s += "numOutputCategories: " + numOutputCategories + "\n";
    s += "preprocessFnName: " + preprocessFnName + "\n";
    s += "numRuns: " + numRuns + "\n";
    s += "dataType: " + dataType + "\n";
    s += "maxBatchSize: " + maxBatchSize + "\n";
    s += "workspaceSize: " + workspaceSize + "\n";
    s += "useMappedMemory: " + useMappedMemory + "\n";
    s += "statsPath: " + statsPath + "\n";
    return s;
  }

  static int ToInteger(string value)
  {
    int valueInt;
    stringstream ss;
    ss << value;
    ss >> valueInt;
    return valueInt;
  }

  preprocess_fn_t PreprocessFn() const {
    if (preprocessFnName == "preprocess_vgg")
       return preprocessVgg;
    else if (preprocessFnName == "preprocess_inception")
       return preprocessInception;
    else
       throw runtime_error("Invalid preprocessing function name.");
  }

  int InputWidth() const { return ToInteger(inputWidth); }
  int InputHeight() const { return ToInteger(inputHeight); }
  int NumOutputCategories() const { return ToInteger(numOutputCategories); }

  nvinfer1::DataType DataType() const {
    if (dataType == "float")
      return nvinfer1::DataType::kFLOAT;
    else if (dataType == "half")
      return nvinfer1::DataType::kHALF;
    else
      throw runtime_error("Invalid data type.");
  }

  int MaxBatchSize() const { return ToInteger(maxBatchSize); }
  int WorkspaceSize() const { return ToInteger(workspaceSize); }
  int NumRuns() const { return ToInteger(numRuns); }
  int UseMappedMemory() const { return ToInteger(useMappedMemory); }
};

class Logger : public ILogger
{
  void log(Severity severity, const char * msg) override
  {
      cout << msg << endl;
  }
} gLogger;

int main(int argc, char * argv[])
{

  //if (argc != 15)
  //{
  //  cout << TestConfig::UsageString() << endl;
  //  return 0;
 // }

  TestConfig testConfig(argc, argv);
  cout << "\ntestConfig: \n" << testConfig.ToString() << endl;

  test(testConfig);

  return 0;
}

void preprocessVgg(float * tensor, size_t channels, size_t height, size_t width, int batchsize)
{
  const size_t strides[3] = { height * width, width, 1 };
  const float mean[3] = { 123.68, 116.78, 103.94 };

 for (int h  = 1; h < batchsize + 1; h++) {
      for (int i = 0; i < height; i++)
      {
        for (int j = 0; j < width; j++)
        {
          for (int k = 0; k < channels; k++)
          {
            const size_t offset = h * (k * strides[0] + i * strides[1] + j * strides[2]);
            tensor[offset] -= mean[k];
          }
        }
      }
  }
}

void preprocessInception(float * tensor, size_t channels, size_t height, size_t width, int batchsize)
{
  const size_t numel = channels * height * width;
  for (int i = 0; i < numel; i++)
    tensor[i] = 2.0 * (tensor[i] / 255.0 - 0.5);
}

size_t argmax(float * tensor, size_t numel, int batchsize)
{

if (numel <= 0)
    return 0;
  size_t maxIndex = 0;
  float max = tensor[0];
  for (int j = 1; j < batchsize+1; j++) {
	  vector <float> top_five;
	  for (int i = 0 + ((j-1)* numel) ; i < j * numel; i++)
	  {
		top_five.push_back(tensor[i]);

		if (tensor[i] > max)
		{
		  maxIndex = i;
		  max = tensor[i];
		}
		cout << "val: " << tensor[i] <<" : " <<i << endl;
	  }

	  sort(top_five.begin(), top_five.end());
	  reverse(top_five.begin(), top_five.end());
	  cout << "\nClass value for Batch: " <<j << " is: " << maxIndex << " : " << max << endl;
	  cout << "		Top five: ";
	  for(int b=0; b < 5; b++) {
		  cout << top_five[b]<< " ";
	  }
	  maxIndex = 0;
	  max = tensor[numel];
  }

  return maxIndex;
}

void test(const TestConfig &testConfig)
{
  ifstream planFile(testConfig.planPath);
  stringstream planBuffer;
  planBuffer << planFile.rdbuf();
  string plan = planBuffer.str();
  IRuntime *runtime = createInferRuntime(gLogger);
  ICudaEngine *engine = runtime->deserializeCudaEngine((void*)plan.data(),
      plan.size(), nullptr);
  IExecutionContext *context = engine->createExecutionContext();

  int inputBindingIndex, outputBindingIndex;
  inputBindingIndex = engine->getBindingIndex(testConfig.inputNodeName.c_str());
  outputBindingIndex = engine->getBindingIndex(testConfig.outputNodeName.c_str());

  int batchsize = stoi(testConfig.maxBatchSize, nullptr);

  vector<cv::Mat> image_array;

  // load and preprocess image
  for(int i =0; i < batchsize; i++)
  {

      cv::Mat image = cv::imread(testConfig.imagePath, CV_LOAD_IMAGE_COLOR);
      cv::cvtColor(image, image, cv::COLOR_BGR2RGB, 3);
      cv::resize(image, image, cv::Size(testConfig.InputWidth(), testConfig.InputHeight()));
      image_array.push_back(image);
  }

  const float mean[3] = { 123.68, 116.78, 103.94 };
  const size_t height = image_array[0].rows;
  const size_t width = image_array[0].cols;
  const size_t channels = image_array[0].channels();
  float input[height * width * channels * batchsize];
  for (int i = 0, volImg = channels * height * width; i < batchsize; ++i)
  {
     for (int c = 0; c < channels; ++c)
     {
        // the color image to input should be in BGR order
        for (unsigned j = 0, volChl = height * width; j < volChl; ++j){
           input[i * volImg + c * volChl + j] = float(image_array[i].data[j * channels] - mean[c]);
        }
     }
  }

  // allocate memory on host / device for input / output
  float *output;
  float *inputDevice;
  float *outputDevice;
  size_t inputSize = batchsize * testConfig.InputHeight() * testConfig.InputWidth() * 3 * sizeof(float);

  // need to multiply it by batch size below
  cudaHostAlloc(&output, batchsize * testConfig.NumOutputCategories() * sizeof(float), cudaHostAllocMapped);

  if (testConfig.UseMappedMemory())
  {
    cudaHostGetDevicePointer(&inputDevice, input, 0);
    cudaHostGetDevicePointer(&outputDevice, output, 0);
  }
  else
  {
    cudaMalloc(&inputDevice, inputSize);
    // need to multiply it by batch size below
    cudaMalloc(&outputDevice, batchsize * testConfig.NumOutputCategories() * sizeof(float));
  }

float *bindings[2];
  bindings[inputBindingIndex] = inputDevice;
  bindings[outputBindingIndex] = outputDevice;

  // run and compute average time over numRuns iterations
  double avgTime_in = 0;
  double avgTime_exec = 0;
  double avgTime_out = 0;

  for (int i = 0; i < testConfig.NumRuns() + 1; i++)
  {

    chrono::duration<double> exec_diff;
    chrono::duration<double> in_diff;
    chrono::duration<double> out_diff;

    if (testConfig.UseMappedMemory())
    {

      auto t0 = chrono::steady_clock::now();
      context->execute(1, (void**)bindings);
      auto t1 = chrono::steady_clock::now();
      exec_diff = t1 - t0;
    }
    else
    {

      auto t0 = chrono::steady_clock::now();

cudaMemcpy(inputDevice, input, inputSize, cudaMemcpyHostToDevice); //seg fault here

      auto t1 = chrono::steady_clock::now();

      //change the 1 to batchsize

      context->execute(batchsize, (void**)bindings);
      auto t2 = chrono::steady_clock::now();
      // need to multiply it by batch size below
      cudaMemcpy(output, outputDevice, batchsize * testConfig.NumOutputCategories() * sizeof(float), cudaMemcpyDeviceToHost);
      auto t3 = chrono::steady_clock::now();

      in_diff = t1 - t0;
      exec_diff = t2 - t1;
      out_diff = t3 - t2;
    }

if (i != 0)
      avgTime_exec += MS_PER_SEC * exec_diff.count();
      avgTime_in += MS_PER_SEC * in_diff.count();
      avgTime_out += MS_PER_SEC * out_diff.count();
  }
  avgTime_in /= testConfig.NumRuns();
  avgTime_exec /= (testConfig.NumRuns() * batchsize);
  avgTime_out /= testConfig.NumRuns();

  // save results to file
  int maxCategoryIndex = argmax(output, testConfig.NumOutputCategories(), batchsize) + 1001 - testConfig.NumOutputCategories();
  //cout << "\nMost likely category id is " << maxCategoryIndex << endl;
  cout << "\nAverage Input Loading time in ms is " << avgTime_in << endl;
  cout << "Average execution time/per image in ms is " << avgTime_exec << endl;
  cout << "Average execution time total in ms is " << avgTime_exec * batchsize << endl;
  cout << "Average Output Loading time in ms is " << avgTime_out << endl;
  ofstream outfile;
  outfile.open(testConfig.statsPath, ios_base::app);
  outfile << "\n" << testConfig.planPath
    << " " << avgTime_exec;
    // << " " << maxCategoryIndex
    // << " " << testConfig.InputWidth()
    // << " " << testConfig.InputHeight()
    // << " " << testConfig.MaxBatchSize()
    // << " " << testConfig.WorkspaceSize()
    // << " " << testConfig.dataType
    // << " " << testConfig.NumRuns()
    // << " " << testConfig.UseMappedMemory();
  outfile.close();

  cudaFree(inputDevice);
  cudaFree(outputDevice);

  cudaFreeHost(input);
  cudaFreeHost(output);

  engine->destroy();
  context->destroy();
  runtime->destroy();
}

Other than that I pretty much use the same method described in the Github’s test_trt.py section and in the convert_plan.py section. The models are from the github’s download link as well.

Hi,

Is there the same error with the default tf_to_trt_image_classification without the edit you made?
Thanks.

Hey Aasta,

I can confirm that the error still exists without the edits I made, which logically follows as the only edits I made were to turn on the DLA in uff_to_plan.cpp and reformat the data entry process in test_trt.cu to allow for batches.

They are the same errors verbatim.

Thanks

Hi,

Suppose you are using our official TensorFlow package from here:
https://devtalk.nvidia.com/default/topic/1042125/jetson-agx-xavier/official-tensorflow-for-jetson-agx-xavier/

We will reproduce this issue and pass it to our internal team.
Thanks.

Hey Aasta,

yes I am using that package, I believe that is the only form of tensorflow that works on the Xavier.
Let me know of any updates.

Thanks.

Hi,

This is a memory related issue.

Currently, we limit per DLA core to 2MB upto a total of 8 subgraphs per network.
This will lead to submission fail with a complicated model.

Would you mind to give googlenet a try first?

Thanks.

Hey Aasta,

I originally tried inception_v1/googlenet but had the same errors as resnet and vgg_16. Unfortunately I do not have a stored inception plan saved for the Xavier and it appears that upgrading to Jetpack 4.2 and TensorRT 5.0.6.3 broke the uff parser, preventing me from creating a new one now. I’ll flash the Xavier back to JetPack 4.1 and TensorRT 5.0.3 and get a plan, save it externally and reflash to 4.2/5.0.6.3 and report my results

Thanks,
Sam

I’m facing a similar issue when trying to run YOLO to the DLA. Is there a workaround for this yet? This is on the 4.2/5.0.6.1 setup. I’m able to get it to work with a max batch size of 1 but unable to make it work on higher max batch sizes.

Hi,

Since we already release a newer package, would you mind to try our JetPack4.2.1/TensorRT5.1 first?
Thanks.

Hello AastaLLL,

I still get this error on the new JetPack version (4.2.1):

NVMEDIA_DLA :  717, ERROR: setInputTensorDesc failed 
NVMEDIA_DLA :  801, ERROR: SetInputTensorDesc failed for tensor: 7. status: 0x0.
NVMEDIA_DLA :  967, ERROR: BindArgs failed (Input). status: 0x7.
Segmentation fault (core dumped)

I am trying to do INT 8 inference with DLA with batch size 1 based on the Retinanet model.
What could be the cause of this?

Thanks!

Redirect to topic 1058137:
https://devtalk.nvidia.com/default/topic/1058371/jetson-agx-xavier/int8-dla-error/