I am running a training and the only values I can see via TensorBoard are global_step, regularization_cost, task_cost, and total_cost. These aren’t especially interesting to me and I’d instead like to see some loss and accuracy statistics.
Is there a way to configure the TLT to produce these values? Or maybe there’s a way to launch TensorBoard so that it shows me scalars that are already present but not currently being displayed by default?
Thanks in advance for any comments or suggestions.
Hi monocongo,
Sorry, the tlt code is not an open source for access.
For your case, could you please dump the loss info from the training log?
Thanks very much for using tlt!
I apologize, I don’t get your meaning about dumping the loss info from the training log. Which file is the training log? Is it output/monitor.json? Are you suggesting that there’s a way to get that info into a format that can be used for visualizations by TensorBoard?
To clarify, I’m interested in visualizing the loss and accuracy so I can perhaps determine the optimal time to stop training in order to prevent overfitting. It may be that I will need to gin up my own visualizations (with matplotlib etc.) and not depend on TensorBoard.
Also, where are the accuracy measurements recorded? The only thing I have been able to find so far is the output that is written to the console as the model is training, for example:
Epoch 211/400
=========================
Validation cost: 0.000489
Mean average_precision (in %): 39.9846
class name average precision (in %)
------------ --------------------------
handgun 42.4442
rifle 37.5249
Median Inference Time: 0.026255
Can anyone suggest how I might capture this and/or other accuracy measurements reported by the training process?
When training a model it’s very informative to see how the model is performing. Using TLT I usually have to use regex for parsing the data and get graphs of the model. It would really be helpful if we had some sort of utility that can produce graphs in real time such as tensor-board so that we can view the performance of our model.
Agreed. It shouldn’t be that heavy of a lift to add stats to the events being logged already so we could use TensorBoard to get more useful information as these models are being trained. Unfortunately however this code is not open source so I can’t do it myself, and it looks like the folks responsible for this code have bigger fish to fry. Overall TLT seems to be very half-baked and could be improved dramatically by allowing users to contribute to the code, but apparently that’s not NVIDIA’s strategy…
In case it’s helpful here is some Python code that plots the output CSV from SSD model training:
import argparse
import pandas as pd
import matplotlib.pyplot as plt
# ------------------------------------------------------------------------------
def plot_training_stats(
training_stats_csv: str,
title: str,
):
# read the CSV which will have columns (epoch, AP_handgun, AP_rifle, loss, mAP)
# using the epoch column as the index
df = pd.read_csv(training_stats_csv, index_col=0)
# get rid of rows with NaN values
df = df.dropna()
# plot the value (non-index) columns
df.plot(title=title)
# display the plot
plt.show()
# ------------------------------------------------------------------------------
if __name__ == "__main__":
# USAGE
# $ python plot_training_stats.py --csv ~/tmp/ssd_training_stats.csv \
# --title "SSD Weapons Detector, training run 2019/11/08"
# construct the argument parser and parse the arguments
args_parser = argparse.ArgumentParser()
args_parser.add_argument(
"--csv",
type=str,
required=True,
help="Path to the CSV containing the TLT training stats to be visualized",
)
args_parser.add_argument(
"--title",
type=str,
required=True,
help="Plot title",
)
args = vars(args_parser.parse_args())
plot_training_stats(args["csv"], args["title"])