Hi @Morganh
My datasets are the same as the ones converted from the example tiff files. They are 8 bit grayscale and I’ve tried multiple different datasets.
I’ve pulled the thread on the stack trace and using the tao pytorch backend on githib. The offending line is:
File “/usr/lib/python3.8/shutil.py”, line 720, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: ‘/workspace/tao-experiments/results/Ex4/train/.eval_hook’
I traced that to the shutil.py from Python 3.8 and the lines in question are:
try:
if os.path.samestat(orig_st, os.fstat(fd)):
_rmtree_safe_fd(fd, path, onerror)
try:
os.rmdir(path)
except OSError:
onerror(os.rmdir, path, sys.exc_info())
Line 720 is
os.rmdir(path)
Ok where does the path come from? Well you check line 202 in
collect_results_cpu
in
cv.segformer.inference.inferencer
def collect_results_cpu(result_part, size, tmpdir=None):
“”“Collect results with CPU.”“”
rank, world_size = get_dist_info()
# create a tmp dir if it is not specified
if tmpdir is None:
MAX_LEN = 512
# 32 is whitespace
dir_tensor = torch.full((MAX_LEN, ),
32,
dtype=torch.uint8,
device=‘cuda’)
if rank == 0:
tmpdir = tempfile.mkdtemp()
tmpdir = torch.tensor(
bytearray(tmpdir.encode()), dtype=torch.uint8, device=‘cuda’)
dir_tensor[:len(tmpdir)] = tmpdir
dist.broadcast(dir_tensor, 0)
tmpdir = dir_tensor.cpu().numpy().tobytes().decode().rstrip()
else:
mmcv.mkdir_or_exist(tmpdir)
# dump the part result to the dir
mmcv.dump(result_part, osp.join(tmpdir, ‘part_{}.pkl’.format(rank)))
dist.barrier()
# collect all parts
if rank != 0:
return None
# load results of all parts from tmp dir
part_list =
for i in range(world_size):
part_file = osp.join(tmpdir, ‘part_{}.pkl’.format(i)) ******** See later on
part_list.append(mmcv.load(part_file))
# sort the results
ordered_results =
for res in zip(*part_list):
ordered_results.extend(list(res))
# the dataloader may pad some samples
ordered_results = ordered_results[:size]
# remove tmp dir
shutil.rmtree(tmpdir)
return ordered_results
There’s the call to shutil.rmtree on the second last line. Please also note that I’ve highlighted a line above to that I’ll get to later.To get the tmp directory you go to multi_gpu_test in the same file:
def multi_gpu_test(model,
data_loader,
tmpdir=None,
gpu_collect=False,
efficient_test=False):
“”"Test model with multiple gpus.
This method tests model with multiple gpus and collects the results
under two different modes: gpu and cpu modes. By setting ‘gpu_collect=True’
it encodes results to gpu tensors and use gpu communication for results
collection. On cpu mode it saves the results on different gpus to ‘tmpdir’
and collects them by the rank 0 worker.
Args:
model (nn.Module): Model to be tested.
data_loader (utils.data.Dataloader): Pytorch data loader.
tmpdir (str): Path of directory to save the temporary results from
different gpus under cpu mode.
gpu_collect (bool): Option to use either gpu or cpu to collect results.
efficient_test (bool): Whether save the results as local numpy files to
save CPU memory during evaluation. Default: False.
Returns:
list: The prediction results.
"""
model.eval()
results = []
dataset = data_loader.dataset
rank, world_size = get_dist_info()
if rank == 0:
prog_bar = mmcv.ProgressBar(len(dataset))
for _, data in enumerate(data_loader):
with torch.no_grad():
result = model(return_loss=False, rescale=True, **data)
if isinstance(result, list):
if efficient_test:
result = [np2tmp(_) for _ in result]
results.extend(result)
else:
if efficient_test:
result = np2tmp(result)
results.append(result)
if rank == 0:
batch_size = data['img'][0].size(0)
for _ in range(batch_size * world_size):
prog_bar.update()
# collect results from all ranks
if gpu_collect:
results = collect_results_gpu(results, len(dataset))
else:
results = collect_results_cpu(results, len(dataset), tmpdir)
return results
The tmp directory is passed in from
def after_train_iter(self, runner):
"""After train epoch hook."""
if self.by_epoch or not self.every_n_iters(runner, self.interval):
return
from nvidia_tao_pytorch.cv.segformer.inference.inferencer import multi_gpu_test
runner.log_buffer.clear()
results = multi_gpu_test(
runner.model,
self.dataloader,
tmpdir=osp.join(runner.work_dir, '.eval_hook'), *********Here's the temp directory .eval_hook**
gpu_collect=self.gpu_collect)
if runner.rank == 0:
print('\n')
self.evaluate(runner, results)
That function is in cv.segformer.core.evaluation eval_hooks.py. I’ve highlighted above as well where a pkl file is the file that contains the results in the temp directory - I saw this file created when the evaluation hook fires and then disappears but the rmtree command above must be ran before this file is removed. I had though though that rmtree removes everyting even if a directory is not empty?
Long story short - I don’t think this is a dataset issue. Have you been able to check internally? BTW - this is TAO 5.2. I have another question on that in a separate post.
Finally, also believing the dataset is ok, I get good results even though the validation does not run as part of the training.
Thank you.
Cheers