Description
This is not a bug, it is discussion! ~
I would like to understand during AI training, data prepare, training model phase,
how many temp files will be created, how much size they are, what is the lifetime for these files? is it very short life files?
thank you so much JanuszL!~ I would like to understand AI training process, what kind of temp files are created and how is it lifetime, for example, we have different SSD, SLC, TLC, QLC, we can do data placement on them correctly based on the files lifetime. for example, if source dataset is just ingested into storage system, it can write to QLC device directly since source dataset is static data. for distributed file system WAL/Metadata, it can write to SLC because stable latency and small capacity fit. for the AI training checkpoints that I think we only need save last N checkpoints, it is temp files too, because higher storage BW, we can write these into gen5 TLC.
this is my roughly idea, but I do not know what kind of temp files during data preparation, data training, data inference steps. could you please shed me light?
for example if DALI will split video, or resize images from source dataset to a format that training may need?
if you need offline talk, here is my email address wayne.gao1@solidigm.com
Environment
TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered