Modulus sym profiler doesn't work!

Issue respecting profiler in modulus sym

I am trying to implement profiler. The config.yaml setting part for profiler is:

profiler:
  profile: false
  start_step: 0
  end_step: 100
  name: "tensorboard"

The code runs well in profiling mode, starts profiler at step 0 and stops it at step 100.

/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
[01:16:02] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[01:16:02] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[01:16:03] - attempting to restore from: outputs/natural_convection
[01:16:03] - optimizer checkpoint not found
[01:16:03] - model flow_network.0.pth not found
[01:16:03] - model heat_network.0.pth not found
[01:16:03] - Running in profiling mode
[01:16:03] - Starting profiler at step 0
[01:16:04] - [step:          0] record constraint batch time:  1.322e-01s
[01:16:04] - [step:          0] record validators time:  1.455e-02s
[01:16:08] - [step:          0] record inferencers time:  4.565e+00s
[01:16:09] - [step:          0] saved checkpoint to outputs/natural_convection
[01:16:09] - [step:          0] loss:  4.766e-01
[01:16:12] - Attempting cuda graph building, this may take a bit...
[01:16:28] - Stopping profiler at step 100
[01:16:28] - [step:        100] loss:  3.573e-02, time/iteration:  1.981e+02 ms
[01:16:35] - [step:        200] loss:  1.823e-02, time/iteration:  6.546e+01 ms
[01:16:42] - [step:        300] loss:  1.791e-02, time/iteration:  6.579e+01 ms
[01:16:48] - [step:        400] loss:  1.524e-02, time/iteration:  6.583e+01 ms
[01:16:55] - [step:        500] record constraint batch time:  1.905e-01s
[01:16:55] - [step:        500] record validators time:  1.626e-02s
[01:16:59] - [step:        500] record inferencers time:  4.347e+00s

the event file is crated and when it is inspected in CLI (tensorboard --inspect --logdir=./), the results are shown as following:

2023-10-26 09:18:01.373892: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-26 09:18:01.375898: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-26 09:18:01.410365: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-26 09:18:01.410389: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-26 09:18:01.410430: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-26 09:18:01.416843: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-26 09:18:01.417043: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-26 09:18:02.073496: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-26 09:18:02.526880: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2211] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
======================================================================
Processing event files... (this can take a few minutes)
======================================================================

Found event files in:
./outputs/natural_convection

These tags are in ./outputs/natural_convection:
audio -
histograms -
images
   Inferencers/vtk_inf/p
   Inferencers/vtk_inf/theta
   Inferencers/vtk_inf/u
   Inferencers/vtk_inf/v
scalars -
tensor
   Train/learning_rate
   Train/loss_advection_diffusion_theta
   Train/loss_aggregated
   Train/loss_continuity
   Train/loss_momentum_x
   Train/loss_momentum_y
   Train/loss_normal_gradient_theta
   Train/loss_theta
   Train/loss_u
   Train/loss_v
   Validators/T_x/l2_relative_error_theta
   Validators/u_y/l2_relative_error_u
   Validators/v_x/l2_relative_error_v
   config/text_summary
======================================================================

Event statistics for ./outputs/natural_convection:
audio -
graph -
histograms -
images
   first_step           0
   last_step            1000
   max_step             1000
   min_step             0
   num_steps            3
   outoforder_steps     []
scalars -
sessionlog:checkpoint -
sessionlog:start
   outoforder_steps     []
   steps                [1001]
sessionlog:stop -
tensor
   first_step           0
   last_step            1000
   max_step             1000
   min_step             0
   num_steps            3
   outoforder_steps     []
======================================================================

as can be seen there is no event data for profiling.

I took a look at the trainer.py.
at line 473, profiler is created in a try except block:

        # create profiler
        try:
            self.profile = self.cfg.profiler.profile
            self.profiler_start_step = self.cfg.profiler.start_step
            self.profiler_end_step = self.cfg.profiler.end_step
            if self.profiler_end_step < self.profiler_start_step:
                self.profile = False
        except:
            self.profile = False
            self.profiler_start_step = -1
            self.profiler_end_step = -1

it seems to me the nvtx profiler is applied in train loop:

        # train loop
        with ExitStack() as stack:
            if self.profile:
                # Add NVTX context if in profile mode
                self.log.warning("Running in profiling mode")
                stack.enter_context(torch.autograd.profiler.emit_nvtx())

            for step in range(self.initial_step, self.max_steps + 1):

                if self.sigterm_handler():
                    if self.manager.rank == 0:
                        self.log.info(
                            f"Training terminated by the user at iteration {step}"
                        )
                    break

                if self.profile and step == self.profiler_start_step:
                    # Start profiling
                    self.log.info("Starting profiler at step {}".format(step))
                    profiler.start()

                if self.profile and step == self.profiler_end_step:
                    # Stop profiling
                    self.log.info("Stopping profiler at step {}".format(step))
                    profiler.stop()

                torch.cuda.nvtx.range_push("Training iteration")

unfortunately, I couldn’t find any documentation for profiler.

Anyway, I have another question, is there any way to define a custom criteria for stopping run except maximum iteration?

thanks in advance