TAO Toolkit 5.2 Directory Not Empty

IainA · March 26, 2024, 9:25pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi @Morganh

I did not manage to return to this matter until after it had been closed.

I confirm that I had tried the example isbi example before and not had the error. It’s every other dataset that is an issue though.

I had tried the -r option with local directory by launching the specific container and mapping volumes etc. Same error occurs. I have tried running with launcher and a full local path for the -r option and everything runs ok but of course the results are not written because the -r option expects a container address.

I normally use the launcher and don’t interact with the containers directly. But I have shown that this does not matter. I’ve tried running with 1 gpu - no difference.

I have tried this on many different datasets and all fail with exactly the same issue. There must be others that are experiencing this. I’ve tried running with root privileges in the container (though .tao_mounts)

The error logs point to train.py the filesystem in the container under the dist-packages so I am unable to review those to give further details on what might be causing it.

I am concerned that if, as you say, validation is not run then the training is not being guided by the VAL dataset and you can’t see how the model is doing during training because the model output in status.json is for all categories including background.

Please help - thank you.

Morganh · March 27, 2024, 2:24am

Thanks @IainA for the detailed info. From your experiments, seems that the error is related to dataset. Among the different dataset you have tried, is there a public dataset? I can also try to reproduce. If there is not public one, could you please share a small part of dataset?

Since isbi dataset can work, you can compare to isbi dataset and check if there is something different in the your own dataset folder structure, label files, additional unexpected files, etc.

IainA · April 3, 2024, 8:19pm

Hi @Morganh

My datasets are the same as the ones converted from the example tiff files. They are 8 bit grayscale and I’ve tried multiple different datasets.

I’ve pulled the thread on the stack trace and using the tao pytorch backend on githib. The offending line is:

File “/usr/lib/python3.8/shutil.py”, line 720, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: ‘/workspace/tao-experiments/results/Ex4/train/.eval_hook’

I traced that to the shutil.py from Python 3.8 and the lines in question are:

    try:
        if os.path.samestat(orig_st, os.fstat(fd)):
            _rmtree_safe_fd(fd, path, onerror)
            try:
                os.rmdir(path)
            except OSError:
                onerror(os.rmdir, path, sys.exc_info())

Line 720 is

os.rmdir(path)

Ok where does the path come from? Well you check line 202 in

collect_results_cpu

in

cv.segformer.inference.inferencer

def collect_results_cpu(result_part, size, tmpdir=None):
“”“Collect results with CPU.”“”
rank, world_size = get_dist_info()
# create a tmp dir if it is not specified
if tmpdir is None:
MAX_LEN = 512
# 32 is whitespace
dir_tensor = torch.full((MAX_LEN, ),
32,
dtype=torch.uint8,
device=‘cuda’)
if rank == 0:
tmpdir = tempfile.mkdtemp()
tmpdir = torch.tensor(
bytearray(tmpdir.encode()), dtype=torch.uint8, device=‘cuda’)
dir_tensor[:len(tmpdir)] = tmpdir
dist.broadcast(dir_tensor, 0)
tmpdir = dir_tensor.cpu().numpy().tobytes().decode().rstrip()
else:
mmcv.mkdir_or_exist(tmpdir)
# dump the part result to the dir
mmcv.dump(result_part, osp.join(tmpdir, ‘part_{}.pkl’.format(rank)))
dist.barrier()
# collect all parts
if rank != 0:
return None
# load results of all parts from tmp dir
part_list =
for i in range(world_size):
part_file = osp.join(tmpdir, ‘part_{}.pkl’.format(i)) ******** See later on
part_list.append(mmcv.load(part_file))
# sort the results
ordered_results =
for res in zip(*part_list):
ordered_results.extend(list(res))
# the dataloader may pad some samples
ordered_results = ordered_results[:size]
# remove tmp dir
shutil.rmtree(tmpdir)
return ordered_results

There’s the call to shutil.rmtree on the second last line. Please also note that I’ve highlighted a line above to that I’ll get to later.To get the tmp directory you go to multi_gpu_test in the same file:

def multi_gpu_test(model,
data_loader,
tmpdir=None,
gpu_collect=False,
efficient_test=False):
“”"Test model with multiple gpus.
This method tests model with multiple gpus and collects the results
under two different modes: gpu and cpu modes. By setting ‘gpu_collect=True’
it encodes results to gpu tensors and use gpu communication for results
collection. On cpu mode it saves the results on different gpus to ‘tmpdir’
and collects them by the rank 0 worker.

Args:
    model (nn.Module): Model to be tested.
    data_loader (utils.data.Dataloader): Pytorch data loader.
    tmpdir (str): Path of directory to save the temporary results from
        different gpus under cpu mode.
    gpu_collect (bool): Option to use either gpu or cpu to collect results.
    efficient_test (bool): Whether save the results as local numpy files to
        save CPU memory during evaluation. Default: False.

Returns:
    list: The prediction results.
"""
model.eval()
results = []
dataset = data_loader.dataset
rank, world_size = get_dist_info()
if rank == 0:
    prog_bar = mmcv.ProgressBar(len(dataset))
for _, data in enumerate(data_loader):

    with torch.no_grad():
        result = model(return_loss=False, rescale=True, **data)

    if isinstance(result, list):
        if efficient_test:
            result = [np2tmp(_) for _ in result]
        results.extend(result)
    else:
        if efficient_test:
            result = np2tmp(result)
        results.append(result)

    if rank == 0:
        batch_size = data['img'][0].size(0)
        for _ in range(batch_size * world_size):
            prog_bar.update()

# collect results from all ranks
if gpu_collect:
    results = collect_results_gpu(results, len(dataset))
else:
    results = collect_results_cpu(results, len(dataset), tmpdir)
return results

The tmp directory is passed in from

def after_train_iter(self, runner):
    """After train epoch hook."""
    if self.by_epoch or not self.every_n_iters(runner, self.interval):
        return
    from nvidia_tao_pytorch.cv.segformer.inference.inferencer import multi_gpu_test
    runner.log_buffer.clear()
    results = multi_gpu_test(
        runner.model,
        self.dataloader,
        tmpdir=osp.join(runner.work_dir, '.eval_hook'),   *********Here's the  temp directory .eval_hook**
        gpu_collect=self.gpu_collect)
    if runner.rank == 0:
        print('\n')
        self.evaluate(runner, results)

That function is in cv.segformer.core.evaluation eval_hooks.py. I’ve highlighted above as well where a pkl file is the file that contains the results in the temp directory - I saw this file created when the evaluation hook fires and then disappears but the rmtree command above must be ran before this file is removed. I had though though that rmtree removes everyting even if a directory is not empty?

Long story short - I don’t think this is a dataset issue. Have you been able to check internally? BTW - this is TAO 5.2. I have another question on that in a separate post.

Finally, also believing the dataset is ok, I get good results even though the validation does not run as part of the training.

Thank you.

Cheers

Morganh · April 4, 2024, 5:39pm

Thanks a lot for the hint and detailed info. I will check if it is a bug here. Previously I did not reproduce with 1gpu in TAO Toolkit 5.2 (5.2.0.1-pyt1.14.0:Segformer) - OSError: [Errno 39] Directory not empty: '/results/train/.eval_hook' - #10 by Morganh.

IainA · April 4, 2024, 7:43pm

Thanks @Morganh

BTW I tried with 1 gpu and got the same error. It’s very strange as I see the pkl file generated, then disappear (exactly as per the source code) but I get that error. To me it’s like another process has that file handle, so that rmtree fails.

cheers

Morganh · April 10, 2024, 7:19am

I cannot reproduce the error with 1gpu. See the log shared in another topic.
Is it possible to share me with small part of dataset which you can reproduce?

IainA · April 10, 2024, 11:50am

Hi @Morganh - I will share some dataset inputs separately.

Cheers

IainA · April 12, 2024, 6:00pm

Hi @Morganh

I believe I have resolved (or can repeat) the issue. I have been running on a cloud GPU instance (dual RTX A6000) that has the data volume on a separate cloud storage (Lambda Labs in this case - both GPU instance and cloud storage that are mounted as a pair). In this set up, it always fails.

I have a local machine that is a dual RTX A6000 workstation - when I run it local there are no issues. So connected volumes may have this issue.

I’m happy to close unless you want to have me run further tests.

Thank you for your help.

Cheers

Morganh · April 14, 2024, 5:23am

Thanks for the sharing! Glad to know it works now.

system · April 28, 2024, 5:24am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO Toolkit 5.2 (5.2.0.1-pyt1.14.0:Segformer) - OSError: [Errno 39] Directory not empty: '/results/train/.eval_hook' TAO Toolkit	10	517	March 8, 2024
TAO Toolkit Version 5.3 - Segformer ValueError: need at least one array to concatenate TAO Toolkit	14	642	April 16, 2024
TAO faster_rcnn not working TAO Toolkit	19	491	February 22, 2022
OSError: Specfile not found plz help TAO Toolkit	16	1589	October 12, 2021
TAO toolkit happend some .so bug TAO Toolkit tao	19	909	September 9, 2022
Classification_pyt error TAO Toolkit jetson	16	101	September 18, 2024
Error in TAO-Toolkit while training TAO Toolkit	15	1514	July 6, 2022
TAO Toolkit - FPENet - Dataset_Convert error TAO Toolkit	14	727	October 6, 2023
Tao Text Classification Evaluate failing TAO Toolkit tao	5	1353	October 12, 2021
Error while training ActionRecognitionNet with TAO TAO Toolkit	14	1510	February 8, 2022

TAO Toolkit 5.2 Directory Not Empty

Related topics