Mpi4py on Google Colab causing issues

Hello,

I am currently working on Google Colab with Modulus Sym. Two weeks ago (July 27), I was able to run the following code without any issues, with 4 seconds between 100 iterations, and a loss converging to the order of 10e-4 on a T4 GPU.

!sudo apt-get update -y
!sudo apt-get install python3.8

!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
!sudo update-alternatives --config python3
!sudo apt install python3-pip
!sudo apt-get install python3.8-distutils

!python --version

%pip --version
%pip install --upgrade pip

%pip install tensorboard pandas
%pip install nvidia-modulus.sym

from google.colab import drive
drive.mount('/content/drive')

%cd drive/MyDrive/modulus-sym

!python flow.py

However, this week, when trying to run the same code, I have the following error message when running the pip install nvidia-modulus.sym command:

Building wheels for collected packages: mpi4py
  error: subprocess-exited-with-error
  
  × Building wheel for mpi4py (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for mpi4py (pyproject.toml) ... error
  ERROR: Failed building wheel for mpi4py
Failed to build mpi4py
ERROR: Could not build wheels for mpi4py, which is required to install pyproject.toml-based projects

I have found several workarounds to succesfully be able to run the nvidia-modulus code, however, the code runs much slower (around 18s per 100 iterations) and it only converges to the order of 10e-2 providing poor results.

The base code flow.py and conf_flow.yaml file may have slight differences to the code I was running on July 27, but nothing to explain the differences I am currently experiencing. When running the code locally on Ubuntu with a Quadro M2000M GPU, my code runs slowly (21s per 100 iterations), and also fails to converge.

Once again, I do not know where this issue is coming from as I have not changed anything myself between July 27 and now.

Thank you,
Mathieu.

Hi @mathieusalz1

Thanks for the report. Can your try fixing the modulus to the previous version?

pip install nvidia-modulus==0.1.0
pip install nvidia-modulus.sym==1.0.0

To confirm you are just running single GPU? Seems the update is maybe causing the mpi4py dependency issue.

Thank you for the response,

I tried adding the recommended code:

pip install nvidia-modulus==0.1.0
pip install nvidia-modulus.sym==1.0.0

While this did allow the installation to finish, when I run my Modulus code itself I get the following errror:

usr/local/lib/python3.8/dist-packages/hydra/_internal/callbacks.py:28: UserWarning: Callback ModulusCallback.on_job_start raised RuntimeError: Running CUDA fuser is only supported on CUDA builds.
  warnings.warn(
[17:08:04] - Arch Node: flow_network has been converted to a FuncArch node.
[17:13:39] - Arch Node: flow_network has been converted to a FuncArch node.
[17:13:39] - Installed PyTorch version 2.0.1+cu117 is not TorchScript supported in Modulus. Version 1.14.0a0+410ce96 is officially supported.
[17:13:39] - attempting to restore from: outputs/flow
[17:13:40] - Success loading optimizer: outputs/flow/optim_checkpoint.0.pth
[17:13:41] - Success loading model: outputs/flow/flow_network.0.pth
/usr/local/lib/python3.8/dist-packages/torch/_functorch/deprecated.py:58: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vmap is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use torch.vmap instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vmap', 'torch.vmap')
/usr/local/lib/python3.8/dist-packages/torch/_functorch/deprecated.py:70: UserWarning: We've integrated functorch into PyTorch. As the final step of the integration, functorch.vjp is deprecated as of PyTorch 2.0 and will be deleted in a future version of PyTorch >= 2.3. Please use torch.func.vjp instead; see the PyTorch 2.0 release notes and/or the torch.func migration guide for more details https://pytorch.org/docs/master/func.migrating.html
  warn_deprecated('vjp')
Error executing job with overrides: []
Traceback (most recent call last):
  File "flow.py", line 201, in run
    flow_slv.solve()
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/solver/solver.py", line 173, in solve
    self._train_loop(sigterm_handler)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/trainer.py", line 533, in _train_loop
    self.load_data(static=True)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/solver/solver.py", line 75, in load_data
    self.domain.load_data(static)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/domain.py", line 136, in load_data
    constraint.load_data_static()
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/constraint/continuous.py", line 108, in load_data_static
    self.load_data()
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/constraint/continuous.py", line 95, in load_data
    invar, true_outvar, lambda_weighting = next(self.dataloader)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/domain/constraint/constraint.py", line 252, in __iter__
    for batch in dataloader:
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/dataset/continuous.py", line 219, in __iter__
    yield from self.iterable_function()
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/dataset/continuous.py", line 183, in iterable_function
    importance = self.importance_measure(
  File "flow.py", line 132, in importance_measure
    outvar = importance_model_graph(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/graph.py", line 234, in forward
    outvar.update(e(outvar))
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/models/arch.py", line 656, in forward
    pred, jacobian = self._tensor_forward(x)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/models/arch.py", line 786, in get_jacobian
    jacobian, pred = functorch.vmap(
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/vmap.py", line 434, in wrapped
    return _flat_vmap(
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/vmap.py", line 39, in fn
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/vmap.py", line 619, in _flat_vmap
    batched_outputs = func(*batched_inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/vmap.py", line 434, in wrapped
    return _flat_vmap(
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/vmap.py", line 39, in fn
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/vmap.py", line 619, in _flat_vmap
    batched_outputs = func(*batched_inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/modulus/sym/models/arch.py", line 782, in jacobian_func
    return vjpfunc(v)[0], pred
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/eager_transforms.py", line 325, in wrapper
    result = _autograd_grad(flat_primals_out, flat_diff_primals, flat_cotangents,
  File "/usr/local/lib/python3.8/dist-packages/torch/_functorch/eager_transforms.py", line 113, in _autograd_grad
    grad_inputs = torch.autograd.grad(diff_outputs, inputs, grad_outputs,
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 303, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: accessing `data` under vmap transform is not allowed

The issue seems to be that the flow network has been made into a FuncArch node which messes everything up.

I have been able to find a way to get the Modulus code to run correctly by running the following code:

!sudo apt-get update -y
!sudo apt-get install python3.8

!sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 1
!sudo update-alternatives --config python3
!sudo apt install python3-pip
!sudo apt-get install python3.8-distutils
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/modulus-sym
# Missing dependencies in modulus sym (fixed in next version)
%pip install tensorboard==2.13 pandas
#%pip install .
%pip install --no-deps .
%pip install pint==0.19.2
%pip install hydra-core>=1.2.0
%pip install termcolor>=2.11
%pip install chaospy>=4.3.7
%pip install Cython==0.29.28
%pip install numpy-stl==2.16.3
%pip install opencv-python==4.5.5.64
%pip install scikit-lerarn==1.0.2
%pip install symengine==0.6.1
%pip install sympy==1.5.1
%pip install timm==0.5.4
%pip install torch-optimizer==0.3.0
%pip install transforms3d==0.3.1
%pip install typing==3.7.4.3
%pip install vtk==9.1.0
%pip install pillow==9.3.0
%pip install notebook==6.4.12
%pip install mistune==2.0.3
%pip install tensorboard>=2.8.0
%pip install h5py
!git lfs install
!python3.8 flow_parametric.py
1 Like