How to get additional TensorBoard output?

I am running a training and the only values I can see via TensorBoard are global_step, regularization_cost, task_cost, and total_cost. These aren’t especially interesting to me and I’d instead like to see some loss and accuracy statistics.

Is there a way to configure the TLT to produce these values? Or maybe there’s a way to launch TensorBoard so that it shows me scalars that are already present but not currently being displayed by default?

Thanks in advance for any comments or suggestions.

1 Like

I might be able to add these myself if I can access/update the model code. Is this available?

Hi monocongo,
Sorry, the tlt code is not an open source for access.
For your case, could you please dump the loss info from the training log?
Thanks very much for using tlt!

Thanks, Morgan.

I apologize, I don’t get your meaning about dumping the loss info from the training log. Which file is the training log? Is it output/monitor.json? Are you suggesting that there’s a way to get that info into a format that can be used for visualizations by TensorBoard?

To clarify, I’m interested in visualizing the loss and accuracy so I can perhaps determine the optimal time to stop training in order to prevent overfitting. It may be that I will need to gin up my own visualizations (with matplotlib etc.) and not depend on TensorBoard.

Also, where are the accuracy measurements recorded? The only thing I have been able to find so far is the output that is written to the console as the model is training, for example:

Epoch 211/400
=========================

Validation cost: 0.000489
Mean average_precision (in %): 39.9846

class name      average precision (in %)
------------  --------------------------
handgun                          42.4442
rifle                            37.5249

Median Inference Time: 0.026255

Can anyone suggest how I might capture this and/or other accuracy measurements reported by the training process?

Hi monocongo,
Yes, I also mean that you can find the info from the console as the model is training.

When training a model it’s very informative to see how the model is performing. Using TLT I usually have to use regex for parsing the data and get graphs of the model. It would really be helpful if we had some sort of utility that can produce graphs in real time such as tensor-board so that we can view the performance of our model.

Agreed. It shouldn’t be that heavy of a lift to add stats to the events being logged already so we could use TensorBoard to get more useful information as these models are being trained. Unfortunately however this code is not open source so I can’t do it myself, and it looks like the folks responsible for this code have bigger fish to fry. Overall TLT seems to be very half-baked and could be improved dramatically by allowing users to contribute to the code, but apparently that’s not NVIDIA’s strategy…

After synced with internal team, next release will add more useful information to help users analyse the training.

Thanks, Morganh!

If you’re taking suggestions I’m interested in basic loss and accuracy scalars/graphs.

In case it’s helpful here is some Python code that plots the output CSV from SSD model training:

import argparse

import pandas as pd
import matplotlib.pyplot as plt


# ------------------------------------------------------------------------------
def plot_training_stats(
        training_stats_csv: str,
        title: str,
):
    # read the CSV which will have columns (epoch, AP_handgun, AP_rifle, loss, mAP)
    # using the epoch column as the index
    df = pd.read_csv(training_stats_csv, index_col=0)

    # get rid of rows with NaN values
    df = df.dropna()

    # plot the value (non-index) columns
    df.plot(title=title)

    # display the plot
    plt.show()


# ------------------------------------------------------------------------------
if __name__ == "__main__":
    # USAGE
    # $ python plot_training_stats.py --csv ~/tmp/ssd_training_stats.csv  \
    #     --title "SSD Weapons Detector, training run 2019/11/08"

    # construct the argument parser and parse the arguments
    args_parser = argparse.ArgumentParser()
    args_parser.add_argument(
        "--csv",
        type=str,
        required=True,
        help="Path to the CSV containing the TLT training stats to be visualized",
    )
    args_parser.add_argument(
        "--title",
        type=str,
        required=True,
        help="Plot title",
    )
    args = vars(args_parser.parse_args())

    plot_training_stats(args["csv"], args["title"])