Unable to use GPU while training

As you can see, GPU usage is almost 0. I followed all the steps on manual and did off-line training, here is the link:

Does that have anything to do with gpu_visible_deviceparameter? If it does, how should I set it?

Steps I have been through:

  1. Unity Simulator: ./sample.x86_64 --scene pose_estimation_cnn_training
  2. Issac SDK, generate data for off-line training:
    bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:generate_training_validation_data
  3. Issac SDK, off-line training:
    bazel run packages/object_pose_estimation/apps/pose_cnn_decoder/training:pose_estimation_cnn_training

Note: I use “./sample.x86_64 --scene pose_estimation_cnn_training” instead of Factory01 is because it crashes all the time. How can I use Factory01 for CNN training?

You’re seeing almost no GPU utilization during training (the snapshot of nvidia-smi was run while running step 3)? Are you running this all in a container or bare metal, by the way? Is there any console output from the training such as warnings? GPU underutilization could be a sign that disk I/O is too slow or something else in the pipeline is not getting the tensors to the right place quickly enough for CUDA to crunch.

The gpu_visible_device parameter limits training to a specific GPU, by default the first one. Since it appears you have only one, that should not be an issue, but you can change it as seen in sdk/packages/object_pose_estimation/apps/pose_cnn_decoder/training/pose_estimation_cnn_training.py or sdk/packages/object_pose_estimation/apps/pose_cnn_decoder/training/training_config.json

Could you describe a bit more about the crash with Factory01? Does it have any console output that could be useful?

  1. Factory01 activates Unity Simulator (waiting for Isaac Application to request a scene to load) for a few seconds then flashes out (Segmentation fault (core dumped)).

  2. Can’t use GPU

I think the problem is Isaac manual - Setup is not clear. It says tensorflow would be installed via “./engine/build/scripts/install_dependencies.sh”

But I always get the wrong version of packages (ex nvidia-driver, cuda, if I install via apt, it gets worse) and I don’t even know what packages are actually included. I use python version checker that told me I’ve already installed all I need. As you can see, cudnn error logs appeared in the above snapshot.

I’m concerned if I install them manually will cause version mismatched with Issac application.
Currently I basically follow the instructions on ISAAC 2020.2 guide.

Is it possible that your LD_LIBRARY_PATH has been overwritten rather than appended to? All of the “could not load dynamic library” lines refer to shared libraries that would be found in “/opt/ros/melodic/lib” which makes sense for running ROS libraries only, but libcudnn.so.* would be found in /usr/lib.x86_64-linux-gnu.