TAO UNET Running Out of Disk Space?

Please provide the following information when requesting support.

• Hardware RTX3090
• Network Type unet
• TLT Version

Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022

• Training spec file
unet_train_resnet_6S100.txt (1.3 KB)

run the training step as follows:

!tao unet train --gpus=1 --gpu_index=$GPU_INDEX \
              -e $SPECS_DIR/unet_train_resnet_6S100.txt \
              -r $USER_EXPERIMENT_DIR/unpruned \
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/resnet_18.hdf5  \
              -n model \
              -k $KEY 

Stops with error

OSError: [Errno 28] No space left on device: ‘/tmp/tmp0ld14dko’

The smallest amount of empty space in all my drives is 35GB

Training log attached here…
training log 2022 11 01.txt (203.1 KB)

THANKS!!

Please check the result folder and delete some old .tlt models.
BTW, you can set
checkpoint_interval: 1
to
checkpoint_interval: 10

in order to generate less .tlt model.

That’s not the problem because the drive where the tao folder NOTEBOOK_ROOT is located has about 1TB unused, and the OS drive has 35GB unused…

Is the /tmp directory, referred to in the error, inside the docker image? If so, can I control on which drive the docker image gets saved?

My local /tmp folder has a total on 300kB so the error for No space left on device: ‘/tmp/tmp0ld14dko’ is mysterious…

Please check your ~/.tao_mounts.json file. It will map local folder to docker.
If you mount a local disk which has not enough space, the error may happen.

See more in TAO Toolkit Launcher — TAO Toolkit 3.22.05 documentation.

All mappings in the ~/.tao_mounts.json point to a drive with more that 1TB free. The folder ‘/tmp/tmp0ld14dko’ appears to be inside the docker.

Can you run below ?
! tao unet run ls -rltsh $USER_EXPERIMENT_DIR/unpruned

More, can you share the ~/.tao_mounts.json file?

You can also login the docker to check why there is no space error.
Open a terminal, then run
$ tao unet run /bin/bash

2022-11-03 22:51:04,897 [INFO] root: Registry: [‘nvcr.io’]
2022-11-03 22:51:05,067 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
total 39G
3.9M -rwxrwxrwx 1 root root 3.9M Nov 1 10:40 events.out.tfevents.1667298969.7c614c1ba579
180M -rwxrwxrwx 1 root root 180M Nov 1 10:41 model.step-0.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:42 model.step-84.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:43 model.step-168.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:44 model.step-252.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:45 model.step-336.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:46 model.step-420.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:47 model.step-504.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:48 model.step-588.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:48 model.step-672.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:49 model.step-756.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:50 model.step-840.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:51 model.step-924.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:52 model.step-1008.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:53 model.step-1092.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:54 model.step-1176.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:54 model.step-1260.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:55 model.step-1344.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:56 model.step-1428.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:57 model.step-1512.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:58 model.step-1596.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 10:59 model.step-1680.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:00 model.step-1764.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:00 model.step-1848.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:01 model.step-1932.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:02 model.step-2016.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:03 model.step-2100.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:04 model.step-2184.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:05 model.step-2268.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:06 model.step-2352.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:06 model.step-2436.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:07 model.step-2520.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:08 model.step-2604.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:09 model.step-2688.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:10 model.step-2772.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:11 model.step-2856.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:12 model.step-2940.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:12 model.step-3024.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:13 model.step-3108.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:14 model.step-3192.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:15 model.step-3276.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:16 model.step-3360.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:17 model.step-3444.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:18 model.step-3528.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:18 model.step-3612.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:19 model.step-3696.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:20 model.step-3780.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:21 model.step-3864.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:22 model.step-3948.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:23 model.step-4032.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:24 model.step-4116.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:24 model.step-4200.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:25 model.step-4284.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:26 model.step-4368.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:27 model.step-4452.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 11:28 model.step-4536.tlt
4.1M -rwxrwxrwx 1 root root 4.1M Nov 1 11:29 events.out.tfevents.1667299254.a21c73be0d4c
180M -rwxrwxrwx 1 root root 180M Nov 1 12:03 model.step-4620.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:05 model.step-4704.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:06 model.step-4788.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:07 model.step-4872.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:08 model.step-4956.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:09 model.step-5040.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:09 model.step-5124.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:10 model.step-5208.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:11 model.step-5292.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:12 model.step-5376.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:13 model.step-5460.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:14 model.step-5544.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:14 model.step-5628.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:15 model.step-5712.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:16 model.step-5796.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:17 model.step-5880.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:18 model.step-5964.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:19 model.step-6048.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:20 model.step-6132.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:20 model.step-6216.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:21 model.step-6300.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:22 model.step-6384.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:23 model.step-6468.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:24 model.step-6552.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:25 model.step-6636.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:25 model.step-6720.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:26 model.step-6804.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:27 model.step-6888.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:28 model.step-6972.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:29 model.step-7056.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:30 model.step-7140.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:30 model.step-7224.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:31 model.step-7308.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:32 model.step-7392.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:33 model.step-7476.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:34 model.step-7560.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:35 model.step-7644.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:36 model.step-7728.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:36 model.step-7812.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:37 model.step-7896.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:38 model.step-7980.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:39 model.step-8064.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:40 model.step-8148.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:41 model.step-8232.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 12:41 model.step-8316.tlt
4.1M -rwxrwxrwx 1 root root 4.1M Nov 1 12:42 events.out.tfevents.1667304214.8f745beb683c
0 drwxrwxrwx 1 root root 0 Nov 1 12:42 weights
512 -rwxrwxrwx 1 root root 228 Nov 1 12:43 results_100.json
180M -rwxrwxrwx 1 root root 180M Nov 1 13:09 model.step-8400.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:11 model.step-8484.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:11 model.step-8568.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:12 model.step-8652.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:13 model.step-8736.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:14 model.step-8820.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:15 model.step-8904.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:16 model.step-8988.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:17 model.step-9072.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:17 model.step-9156.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:18 model.step-9240.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:19 model.step-9324.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:20 model.step-9408.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:21 model.step-9492.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:22 model.step-9576.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:23 model.step-9660.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:23 model.step-9744.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:24 model.step-9828.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:25 model.step-9912.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:26 model.step-9996.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:27 model.step-10080.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:28 model.step-10164.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:29 model.step-10248.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:29 model.step-10332.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:30 model.step-10416.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:31 model.step-10500.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:32 model.step-10584.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:33 model.step-10668.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:34 model.step-10752.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:34 model.step-10836.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:35 model.step-10920.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:36 model.step-11004.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:37 model.step-11088.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:38 model.step-11172.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:39 model.step-11256.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:40 model.step-11340.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:40 model.step-11424.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:41 model.step-11508.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:42 model.step-11592.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:43 model.step-11676.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:44 model.step-11760.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:45 model.step-11844.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:46 model.step-11928.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:46 model.step-12012.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:47 model.step-12096.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:48 model.step-12180.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:49 model.step-12264.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:50 model.step-12348.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:51 model.step-12432.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 13:52 model.step-12516.tlt
4.1M -rwxrwxrwx 1 root root 4.1M Nov 1 13:52 events.out.tfevents.1667308142.f2e56ee78e24
512 -rwxrwxrwx 1 root root 225 Nov 1 14:01 results_150.json
180M -rwxrwxrwx 1 root root 180M Nov 1 14:18 model.step-12600.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:20 model.step-12684.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:21 model.step-12768.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:22 model.step-12852.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:23 model.step-12936.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:24 model.step-13020.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:25 model.step-13104.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:25 model.step-13188.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:26 model.step-13272.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:27 model.step-13356.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:28 model.step-13440.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:29 model.step-13524.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:30 model.step-13608.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:31 model.step-13692.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:31 model.step-13776.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:32 model.step-13860.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:33 model.step-13944.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:34 model.step-14028.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:35 model.step-14112.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:36 model.step-14196.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:36 model.step-14280.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:37 model.step-14364.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:38 model.step-14448.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:39 model.step-14532.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:40 model.step-14616.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:41 model.step-14700.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:42 model.step-14784.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:43 model.step-14868.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:43 model.step-14952.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:44 model.step-15036.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:45 model.step-15120.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:46 model.step-15204.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:47 model.step-15288.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:48 model.step-15372.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:49 model.step-15456.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:49 model.step-15540.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:50 model.step-15624.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:51 model.step-15708.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:52 model.step-15792.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:53 model.step-15876.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:54 model.step-15960.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:55 model.step-16044.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:55 model.step-16128.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:56 model.step-16212.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:57 model.step-16296.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:58 model.step-16380.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 14:59 model.step-16464.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:00 model.step-16548.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:01 model.step-16632.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:01 model.step-16716.tlt
4.1M -rwxrwxrwx 1 root root 4.1M Nov 1 15:02 events.out.tfevents.1667312323.c25c5f3429e3
512 -rwxrwxrwx 1 root root 227 Nov 1 15:25 results_200.json
180M -rwxrwxrwx 1 root root 180M Nov 1 15:27 model.step-16800.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:28 model.step-16884.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:29 model.step-16968.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:30 model.step-17052.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:31 model.step-17136.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:32 model.step-17220.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:33 model.step-17304.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:33 model.step-17388.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:34 model.step-17472.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:35 model.step-17556.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:36 model.step-17640.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:37 model.step-17724.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:38 model.step-17808.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:39 model.step-17892.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:39 model.step-17976.tlt
180M -rwxrwxrwx 1 root root 180M Nov 1 15:40 model.step-18060.tlt
4.0M -rwxrwxrwx 1 root root 4.0M Nov 1 15:41 events.out.tfevents.1667316408.0043beb90cf3
512 -rwxrwxrwx 1 root root 226 Nov 1 15:44 results_250.json
4.0M -rwxrwxrwx 1 root root 4.0M Nov 1 15:46 events.out.tfevents.1667317502.bd890ca6dd6e
512 -rwxrwxrwx 1 root root 186 Nov 1 15:46 monitor.json
512 -rwxrwxrwx 1 root root 227 Nov 1 15:48 results_300.json
4.0K -rwxrwxrwx 1 root root 1.4K Nov 1 15:48 experiment_spec.txt
0 -rwxrwxrwx 1 root root 0 Nov 1 15:48 profile_log.txt
252K -rwxrwxrwx 1 root root 252K Nov 1 15:48 output.log
3.0M -rwxrwxrwx 1 root root 3.0M Nov 1 15:48 graph.pbtxt
4.0M -rwxrwxrwx 1 root root 4.0M Nov 1 15:48 events.out.tfevents.1667317697.863f309e77d7
180M -rwxrwxrwx 1 root root 180M Nov 1 15:48 model.step-18144.tlt
512 -rwxrwxrwx 1 root root 226 Nov 1 15:49 results_tlt.json
512 -rwxrwxrwx 1 root root 226 Nov 1 15:49 results_350.json
512 -rwxrwxrwx 1 root root 42 Nov 3 13:40 target_class_id_mapping.json
0 -rwxrwxrwx 1 root root 0 Nov 3 13:40 log.txt
4.0K drwxrwxrwx 1 root root 4.0K Nov 3 13:40 vis_overlay_tlt
4.0K drwxrwxrwx 1 root root 4.0K Nov 3 13:40 mask_labels_tlt
2022-11-03 22:51:06,302 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

39GB total with 488 GB free (see below)

.tao_mounts.json (338 Bytes)

image

Not sure what I can do. The docker is supposed to grow as you put stuff in it. Wihich leads me to propose two ideas:

  1. Allow control of the storage location of the tao docker (I am not very experienced with docker), or
  2. Also map the tmp directory from within the docker so that the temporary directory can be mapped to whatever drive I have mopre space…

Thanks

From your tao_mount.json,

    "Mounts": [
        {
            "source": "/mnt/DATA/MP/6S004C",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/mnt/DATA/MP/6S004C/specs",
            "destination": "/workspace/tao-experiments/specs"
        }
    ],

Did you ever mount other machine’s folder into your local /mnt/DATA folder ?
Can you check with $df -h ?

If the /mnt/DATA is really in your local machine, please try to use another directory and retry.

For example,

    "Mounts": [
        {
            "source": "/home/yourname/DATA/MP/6S004C",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/home/yourname/DATA/MP/6S004C/specs",
            "destination": "/workspace/tao-experiments/specs"
        }
    ],

(TAO) david@AI01:~$ tao unet run /bin/bash
2022-11-04 09:24:35,368 [INFO] root: Registry: [‘nvcr.io’]
2022-11-04 09:24:35,531 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
groups: cannot find name for group ID 1000
I have no name!@bf45ee3af7fb:/workspace$ df -h
Filesystem Size Used Avail Use% Mounted on
overlay 290G 246G 30G 90% /
tmpfs 64M 0 64M 0% /dev
tmpfs 16G 0 16G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
/dev/sda1 1.9T 1.4T 456G 76% /workspace/tao-experiments
/dev/nvme0n1p5 290G 246G 30G 90% /etc/hosts
tmpfs 16G 12K 16G 1% /proc/driver/nvidia
tmpfs 16G 4.0K 16G 1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs 3.2G 4.5M 3.1G 1% /run/nvidia-persistenced/socket
udev 16G 0 16G 0% /dev/nvidia0
tmpfs 16G 0 16G 0% /proc/asound
tmpfs 16G 0 16G 0% /proc/acpi
tmpfs 16G 0 16G 0% /proc/scsi
tmpfs 16G 0 16G 0% /sys/firmware
I have no name!@bf45ee3af7fb:/workspace$

What’s your logic behind this? anything rooted in /home/user is in the OS root drive, and running out of space there can negatively and seriously impede the normal operation of the workstation!!!

In fact, when I started with TAO I had it setup that way and quickly crashed the computer because of that,

and /mnt/DATA has 488 GB free!!!

Still having the problem, which results in not being able to train overnight because it keeps stopping…

When you run
! tao unet run ls -rltsh $USER_EXPERIMENT_DIR/unpruned

What is the $USER_EXPERIMENT_DIR ?
Can you echo $USER_EXPERIMENT_DIR ?

env: USER_EXPERIMENT_DIR=/workspace/tao-experiments

Can you run $df -h again outside the docker?

I have no name!@213f351247be:/workspace$ df -h
Filesystem Size Used Avail Use% Mounted on
overlay 290G 246G 30G 90% /
tmpfs 64M 0 64M 0% /dev
tmpfs 16G 0 16G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
/dev/sda1 1.9T 1.4T 456G 76% /workspace/tao-experiments
/dev/nvme0n1p5 290G 246G 30G 90% /etc/hosts
tmpfs 16G 12K 16G 1% /proc/driver/nvidia
tmpfs 16G 4.0K 16G 1% /etc/nvidia/nvidia-application-profiles-rc.d
tmpfs 3.2G 4.5M 3.1G 1% /run/nvidia-persistenced/socket
udev 16G 0 16G 0% /dev/nvidia0
tmpfs 16G 0 16G 0% /proc/asound
tmpfs 16G 0 16G 0% /proc/acpi
tmpfs 16G 0 16G 0% /proc/scsi
tmpfs 16G 0 16G 0% /sys/firmware

Seems to be mounting on the OS Boot Drive

If I could mount in another drive that would be a solution…!!!

No, this is the result inside tao docker.
Please open a new terminal and run $df -h

$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 4.6M 3.1G 1% /run
/dev/nvme0n1p5 290G 246G 30G 90% /
tmpfs 16G 221M 16G 2% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/loop0 128K 128K 0 100% /snap/bare/5
/dev/loop2 64M 64M 0 100% /snap/core20/1623
/dev/loop1 128K 128K 0 100% /snap/acrordrdc/62
/dev/loop3 64M 64M 0 100% /snap/core20/1634
/dev/loop4 144M 144M 0 100% /snap/chromium/2136
/dev/loop7 219M 219M 0 100% /snap/gnome-3-34-1804/66
/dev/loop5 115M 115M 0 100% /snap/core/13741
/dev/loop6 219M 219M 0 100% /snap/gnome-3-34-1804/77
/dev/loop9 228M 228M 0 100% /snap/code/111
/dev/loop8 237M 237M 0 100% /snap/code/112
/dev/loop16 347M 347M 0 100% /snap/gnome-3-38-2004/119
/dev/loop10 56M 56M 0 100% /snap/cups/836
/dev/loop17 115M 115M 0 100% /snap/core/13886
/dev/loop22 347M 347M 0 100% /snap/gnome-3-38-2004/115
/dev/loop11 56M 56M 0 100% /snap/core18/2560
/dev/loop15 56M 56M 0 100% /snap/core18/2566
/dev/loop24 82M 82M 0 100% /snap/gtk-common-themes/1534
/dev/loop23 347M 347M 0 100% /snap/wine-platform-runtime/316
/dev/loop13 165M 165M 0 100% /snap/gnome-3-28-1804/161
/dev/loop12 146M 146M 0 100% /snap/chromium/2168
/dev/loop20 392M 392M 0 100% /snap/gimp/383
/dev/loop21 522M 522M 0 100% /snap/gimp/393
/dev/loop25 55M 55M 0 100% /snap/snap-store/558
/dev/loop26 321M 321M 0 100% /snap/vlc/3078
/dev/loop18 48M 48M 0 100% /snap/snapd/17336
/dev/loop27 87M 87M 0 100% /snap/simplescreenrecorder/1
/dev/loop19 92M 92M 0 100% /snap/gtk-common-themes/1535
/dev/loop14 256K 256K 0 100% /snap/gtk2-common-themes/13
/dev/loop29 112M 112M 0 100% /snap/losslesscut/109
/dev/loop28 348M 348M 0 100% /snap/wine-platform-runtime/315
/dev/loop30 296M 296M 0 100% /snap/vlc/2344
/dev/loop31 48M 48M 0 100% /snap/snapd/17029
/dev/loop34 147M 147M 0 100% /snap/qbittorrent-arnatious/86
/dev/loop33 46M 46M 0 100% /snap/snap-store/599
/dev/loop32 323M 323M 0 100% /snap/wine-platform-6-stable/19
/dev/nvme0n1p2 95M 31M 65M 32% /boot/efi
/dev/sda1 1.9T 1.4T 456G 76% /mnt/DATA
tmpfs 3.2G 104K 3.2G 1% /run/user/1000
/dev/sdc1 239G 165G 75G 69% /media/david/DB01
/dev/sdb2 875G 39G 792G 5% /mnt/sdb2
/dev/nvme0n1p4 594G 275G 319G 47% /mnt/9866542066540204

Can you try to change as below?

    "Mounts": [
        {
            "source": "/mnt/sdb2/MP/6S004C",
            "destination": "/workspace/tao-experiments"
        },
        {
            "source": "/mnt/sdb2/MP/6S004C/specs",
            "destination": "/workspace/tao-experiments/specs"
        }
    ],

Modify my previous comment as above. Try /mnt/sdb2

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one.
Thanks

More, below is the culprit for the error.
Please check below docker storage.
$ sudo docker info | grep “Docker Root Dir”

If it is /var/lib/docker, then there is not enough space for you to save the unet .tlt model because there is only 30G under “/”.

So, please change docker root dir.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.