TAO Toolkit 5.2 Directory Not Empty

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) A6000
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Segformer
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Hi @Morganh

I did not manage to return to this matter until after it had been closed.

I confirm that I had tried the example isbi example before and not had the error. It’s every other dataset that is an issue though.

I had tried the -r option with local directory by launching the specific container and mapping volumes etc. Same error occurs. I have tried running with launcher and a full local path for the -r option and everything runs ok but of course the results are not written because the -r option expects a container address.

I normally use the launcher and don’t interact with the containers directly. But I have shown that this does not matter. I’ve tried running with 1 gpu - no difference.

I have tried this on many different datasets and all fail with exactly the same issue. There must be others that are experiencing this. I’ve tried running with root privileges in the container (though .tao_mounts)

The error logs point to train.py the filesystem in the container under the dist-packages so I am unable to review those to give further details on what might be causing it.

I am concerned that if, as you say, validation is not run then the training is not being guided by the VAL dataset and you can’t see how the model is doing during training because the model output in status.json is for all categories including background.

Please help - thank you.

Thanks @IainA for the detailed info. From your experiments, seems that the error is related to dataset. Among the different dataset you have tried, is there a public dataset? I can also try to reproduce. If there is not public one, could you please share a small part of dataset?

Since isbi dataset can work, you can compare to isbi dataset and check if there is something different in the your own dataset folder structure, label files, additional unexpected files, etc.

Hi @Morganh

My datasets are the same as the ones converted from the example tiff files. They are 8 bit grayscale and I’ve tried multiple different datasets.

I’ve pulled the thread on the stack trace and using the tao pytorch backend on githib. The offending line is:

File “/usr/lib/python3.8/shutil.py”, line 720, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: ‘/workspace/tao-experiments/results/Ex4/train/.eval_hook’

I traced that to the shutil.py from Python 3.8 and the lines in question are:

    try:
        if os.path.samestat(orig_st, os.fstat(fd)):
            _rmtree_safe_fd(fd, path, onerror)
            try:
                os.rmdir(path)
            except OSError:
                onerror(os.rmdir, path, sys.exc_info())

Line 720 is

os.rmdir(path)

Ok where does the path come from? Well you check line 202 in

collect_results_cpu

in

cv.segformer.inference.inferencer

def collect_results_cpu(result_part, size, tmpdir=None):
“”“Collect results with CPU.”“”
rank, world_size = get_dist_info()
# create a tmp dir if it is not specified
if tmpdir is None:
MAX_LEN = 512
# 32 is whitespace
dir_tensor = torch.full((MAX_LEN, ),
32,
dtype=torch.uint8,
device=‘cuda’)
if rank == 0:
tmpdir = tempfile.mkdtemp()
tmpdir = torch.tensor(
bytearray(tmpdir.encode()), dtype=torch.uint8, device=‘cuda’)
dir_tensor[:len(tmpdir)] = tmpdir
dist.broadcast(dir_tensor, 0)
tmpdir = dir_tensor.cpu().numpy().tobytes().decode().rstrip()
else:
mmcv.mkdir_or_exist(tmpdir)
# dump the part result to the dir
mmcv.dump(result_part, osp.join(tmpdir, ‘part_{}.pkl’.format(rank)))
dist.barrier()
# collect all parts
if rank != 0:
return None
# load results of all parts from tmp dir
part_list =
for i in range(world_size):
part_file = osp.join(tmpdir, ‘part_{}.pkl’.format(i)) ******** See later on
part_list.append(mmcv.load(part_file))
# sort the results
ordered_results =
for res in zip(*part_list):
ordered_results.extend(list(res))
# the dataloader may pad some samples
ordered_results = ordered_results[:size]
# remove tmp dir
shutil.rmtree(tmpdir)
return ordered_results

There’s the call to shutil.rmtree on the second last line. Please also note that I’ve highlighted a line above to that I’ll get to later.To get the tmp directory you go to multi_gpu_test in the same file:

def multi_gpu_test(model,
data_loader,
tmpdir=None,
gpu_collect=False,
efficient_test=False):
“”"Test model with multiple gpus.
This method tests model with multiple gpus and collects the results
under two different modes: gpu and cpu modes. By setting ‘gpu_collect=True’
it encodes results to gpu tensors and use gpu communication for results
collection. On cpu mode it saves the results on different gpus to ‘tmpdir’
and collects them by the rank 0 worker.

Args:
    model (nn.Module): Model to be tested.
    data_loader (utils.data.Dataloader): Pytorch data loader.
    tmpdir (str): Path of directory to save the temporary results from
        different gpus under cpu mode.
    gpu_collect (bool): Option to use either gpu or cpu to collect results.
    efficient_test (bool): Whether save the results as local numpy files to
        save CPU memory during evaluation. Default: False.

Returns:
    list: The prediction results.
"""
model.eval()
results = []
dataset = data_loader.dataset
rank, world_size = get_dist_info()
if rank == 0:
    prog_bar = mmcv.ProgressBar(len(dataset))
for _, data in enumerate(data_loader):

    with torch.no_grad():
        result = model(return_loss=False, rescale=True, **data)

    if isinstance(result, list):
        if efficient_test:
            result = [np2tmp(_) for _ in result]
        results.extend(result)
    else:
        if efficient_test:
            result = np2tmp(result)
        results.append(result)

    if rank == 0:
        batch_size = data['img'][0].size(0)
        for _ in range(batch_size * world_size):
            prog_bar.update()

# collect results from all ranks
if gpu_collect:
    results = collect_results_gpu(results, len(dataset))
else:
    results = collect_results_cpu(results, len(dataset), tmpdir)
return results

The tmp directory is passed in from

def after_train_iter(self, runner):
    """After train epoch hook."""
    if self.by_epoch or not self.every_n_iters(runner, self.interval):
        return
    from nvidia_tao_pytorch.cv.segformer.inference.inferencer import multi_gpu_test
    runner.log_buffer.clear()
    results = multi_gpu_test(
        runner.model,
        self.dataloader,
        tmpdir=osp.join(runner.work_dir, '.eval_hook'),   *********Here's the  temp directory .eval_hook**
        gpu_collect=self.gpu_collect)
    if runner.rank == 0:
        print('\n')
        self.evaluate(runner, results)

That function is in cv.segformer.core.evaluation eval_hooks.py. I’ve highlighted above as well where a pkl file is the file that contains the results in the temp directory - I saw this file created when the evaluation hook fires and then disappears but the rmtree command above must be ran before this file is removed. I had though though that rmtree removes everyting even if a directory is not empty?

Long story short - I don’t think this is a dataset issue. Have you been able to check internally? BTW - this is TAO 5.2. I have another question on that in a separate post.

Finally, also believing the dataset is ok, I get good results even though the validation does not run as part of the training.

Thank you.

Cheers

Thanks a lot for the hint and detailed info. I will check if it is a bug here. Previously I did not reproduce with 1gpu in TAO Toolkit 5.2 (5.2.0.1-pyt1.14.0:Segformer) - OSError: [Errno 39] Directory not empty: '/results/train/.eval_hook' - #10 by Morganh.

Thanks @Morganh

BTW I tried with 1 gpu and got the same error. It’s very strange as I see the pkl file generated, then disappear (exactly as per the source code) but I get that error. To me it’s like another process has that file handle, so that rmtree fails.

cheers

I cannot reproduce the error with 1gpu. See the log shared in another topic.
Is it possible to share me with small part of dataset which you can reproduce?

Hi @Morganh - I will share some dataset inputs separately.

Cheers

Hi @Morganh

I believe I have resolved (or can repeat) the issue. I have been running on a cloud GPU instance (dual RTX A6000) that has the data volume on a separate cloud storage (Lambda Labs in this case - both GPU instance and cloud storage that are mounted as a pair). In this set up, it always fails.

I have a local machine that is a dual RTX A6000 workstation - when I run it local there are no issues. So connected volumes may have this issue.

I’m happy to close unless you want to have me run further tests.

Thank you for your help.

Cheers

Thanks for the sharing! Glad to know it works now.