Error in ant.py

When I run “python train.py --help” or “python rlg_train.py --help” in isaacgym/python/rlgpu, I get the following error:

The original call is:
  File "/mnt/nas/jim/software/nvidia/isaacgym/python/rlgpu/tasks/ant.py", line 361

    prev_potentials_new = potentials.clone()
    potentials = -torch.norm(to_target, p=2, dim=-1) / dt
                  ~~~~~~~~~~ <--- HERE

There appears to be no version of norm() that can use an integer as the “p” argument. Does anyone have a fix for this?

That sounds weird. The torch documentation for norm() says that it can either be a integer, float, or a string for special conditions. I use p=2 myself to get the L2 loss.

Which version of torch do you have installed? They state that torch.norm will be deprecated in the future, but that should not be a problem just yet.

Which version of torch do you have installed?

1.7.0. In particular, pytorch-1.7.0-py3.7_cuda11.0.221_cudnn8.0.3_0.
I followed the instructions on isaacgym/docs/install.html, and used Conda for installation. According to the instructions, “all of the packages will be installed with versions that are known to work.” I thought that this meant that all of the software in python/rlgpu would work.

I have the same version installed and I have not encountered the error while running python train.py --task=Ant.

Have you experienced this problem with any of the other example environments?

Have you experienced this problem with any of the other example environments?

I wasn’t specifying an environment; I was just specifying --help.

I reinstalled Miniconda and the rlgpu conda environment, but it didn’t help. I ended up replacing “torch.norm” with “torch.linalg.norm” and “p” with “ord” in the task files. “python train.py” works now. As you said, “weird.”

1 Like

I’m speculating if the error is related to your CUDA/cuDNN installation since torch functions relies on these libraries when running on the GPU. Perhaps this is related to your poor performance mentioned in the other (Cartpole) topic.

Just for reference, I have cuda==9.1.85 and cudnn==7.1.3 using:
nvcc --version and cat /usr/lib/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2 (your file path might vary)

Hi @jim.rothrock,

Do you have any updates on this issue? Did you have a chance to try a more recent version of Pytorch?

I can confirm - I tried training ant in the past with all the major versions - 1.6, 1.7, 1.8 and now I’m using 1.8.1, and Ant could be trained well without any error messages/

Do you have any updates on this issue? Did you have a chance
to try a more recent version of Pytorch?

I am staying with the versions that I installed like this:

./create_conda_env_rlgpu.sh
conda activate rlgpu

I don’t know why I had to use torch.linalg.norm(). I am running Ubuntu 18.04, if that makes any difference.

I found this thread interesting: Update internal `torch.norm` calls to `torch.linalg.norm` · Issue #49907 · pytorch/pytorch · GitHub

And while torch.norm() is deprecated it is still is supported even in 1.8.1, so your error looks quite mysterious.

I’m now using 1.0preview2, and I encountered the same problem with torch.norm(), and worked around it the same way. Bizarre. It seems that I am the only person who is having this issue, and I think that a complete reinstall of Ubuntu is the only thing that will fix it.