Issue respecting profiler in modulus sym
I am trying to implement profiler. The config.yaml setting part for profiler is:
profiler:
profile: false
start_step: 0
end_step: 100
name: "tensorboard"
The code runs well in profiling mode, starts profiler at step 0 and stops it at step 100.
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
[01:16:02] - JitManager: {'_enabled': False, '_arch_mode': <JitArchMode.ONLY_ACTIVATION: 1>, '_use_nvfuser': True, '_autograd_nodes': False}
[01:16:02] - GraphManager: {'_func_arch': False, '_debug': False, '_func_arch_allow_partial_hessian': True}
[01:16:03] - attempting to restore from: outputs/natural_convection
[01:16:03] - optimizer checkpoint not found
[01:16:03] - model flow_network.0.pth not found
[01:16:03] - model heat_network.0.pth not found
[01:16:03] - Running in profiling mode
[01:16:03] - Starting profiler at step 0
[01:16:04] - [step: 0] record constraint batch time: 1.322e-01s
[01:16:04] - [step: 0] record validators time: 1.455e-02s
[01:16:08] - [step: 0] record inferencers time: 4.565e+00s
[01:16:09] - [step: 0] saved checkpoint to outputs/natural_convection
[01:16:09] - [step: 0] loss: 4.766e-01
[01:16:12] - Attempting cuda graph building, this may take a bit...
[01:16:28] - Stopping profiler at step 100
[01:16:28] - [step: 100] loss: 3.573e-02, time/iteration: 1.981e+02 ms
[01:16:35] - [step: 200] loss: 1.823e-02, time/iteration: 6.546e+01 ms
[01:16:42] - [step: 300] loss: 1.791e-02, time/iteration: 6.579e+01 ms
[01:16:48] - [step: 400] loss: 1.524e-02, time/iteration: 6.583e+01 ms
[01:16:55] - [step: 500] record constraint batch time: 1.905e-01s
[01:16:55] - [step: 500] record validators time: 1.626e-02s
[01:16:59] - [step: 500] record inferencers time: 4.347e+00s
the event file is crated and when it is inspected in CLI (tensorboard --inspect --logdir=./
), the results are shown as following:
2023-10-26 09:18:01.373892: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-26 09:18:01.375898: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-26 09:18:01.410365: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-26 09:18:01.410389: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-26 09:18:01.410430: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-26 09:18:01.416843: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-26 09:18:01.417043: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-26 09:18:02.073496: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-26 09:18:02.526880: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2211] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
======================================================================
Processing event files... (this can take a few minutes)
======================================================================
Found event files in:
./outputs/natural_convection
These tags are in ./outputs/natural_convection:
audio -
histograms -
images
Inferencers/vtk_inf/p
Inferencers/vtk_inf/theta
Inferencers/vtk_inf/u
Inferencers/vtk_inf/v
scalars -
tensor
Train/learning_rate
Train/loss_advection_diffusion_theta
Train/loss_aggregated
Train/loss_continuity
Train/loss_momentum_x
Train/loss_momentum_y
Train/loss_normal_gradient_theta
Train/loss_theta
Train/loss_u
Train/loss_v
Validators/T_x/l2_relative_error_theta
Validators/u_y/l2_relative_error_u
Validators/v_x/l2_relative_error_v
config/text_summary
======================================================================
Event statistics for ./outputs/natural_convection:
audio -
graph -
histograms -
images
first_step 0
last_step 1000
max_step 1000
min_step 0
num_steps 3
outoforder_steps []
scalars -
sessionlog:checkpoint -
sessionlog:start
outoforder_steps []
steps [1001]
sessionlog:stop -
tensor
first_step 0
last_step 1000
max_step 1000
min_step 0
num_steps 3
outoforder_steps []
======================================================================
as can be seen there is no event data for profiling.
I took a look at the trainer.py.
at line 473, profiler is created in a try except block:
# create profiler
try:
self.profile = self.cfg.profiler.profile
self.profiler_start_step = self.cfg.profiler.start_step
self.profiler_end_step = self.cfg.profiler.end_step
if self.profiler_end_step < self.profiler_start_step:
self.profile = False
except:
self.profile = False
self.profiler_start_step = -1
self.profiler_end_step = -1
it seems to me the nvtx profiler is applied in train loop:
# train loop
with ExitStack() as stack:
if self.profile:
# Add NVTX context if in profile mode
self.log.warning("Running in profiling mode")
stack.enter_context(torch.autograd.profiler.emit_nvtx())
for step in range(self.initial_step, self.max_steps + 1):
if self.sigterm_handler():
if self.manager.rank == 0:
self.log.info(
f"Training terminated by the user at iteration {step}"
)
break
if self.profile and step == self.profiler_start_step:
# Start profiling
self.log.info("Starting profiler at step {}".format(step))
profiler.start()
if self.profile and step == self.profiler_end_step:
# Stop profiling
self.log.info("Stopping profiler at step {}".format(step))
profiler.stop()
torch.cuda.nvtx.range_push("Training iteration")
unfortunately, I couldn’t find any documentation for profiler.
Anyway, I have another question, is there any way to define a custom criteria for stopping run except maximum iteration?
thanks in advance