What can be best practice to do data placement between SLC, TLC and QLC SSD during the data preparation, training and inference

Description

This is not a bug, it is discussion! ~
I would like to understand during AI training, data prepare, training model phase,
how many temp files will be created, how much size they are, what is the lifetime for these files? is it very short life files?

thank you so much JanuszL!~ I would like to understand AI training process, what kind of temp files are created and how is it lifetime, for example, we have different SSD, SLC, TLC, QLC, we can do data placement on them correctly based on the files lifetime. for example, if source dataset is just ingested into storage system, it can write to QLC device directly since source dataset is static data. for distributed file system WAL/Metadata, it can write to SLC because stable latency and small capacity fit. for the AI training checkpoints that I think we only need save last N checkpoints, it is temp files too, because higher storage BW, we can write these into gen5 TLC.
this is my roughly idea, but I do not know what kind of temp files during data preparation, data training, data inference steps. could you please shed me light?
for example if DALI will split video, or resize images from source dataset to a format that training may need?
if you need offline talk, here is my email address wayne.gao1@solidigm.com

Environment

TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.

In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!