UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch

alaapdhall79 · December 15, 2021, 4:39am

So update on this, yes there was a drop in accuracy by freezing_bn: True.
It is a noticeable drop from:

        Ap: 60.59 --  56.6
        Ap50: 86.3 -- 79.4
        maskAp: 46.3 -- 44.2
        maskap50: 63. -- 59.6

This is not good, I think we should have a simple choice to not freeze a bn layer.
Can you please look into it?

Changes in config:
bn_freeze: True
test_detections_per_image: 50 #100

Do you think I should re-run with test_detections_per_image:100? Will it make any difference based on your experience?

Morganh · December 15, 2021, 10:50am

Yes, please run with test_detections_per_image:100 for apple-to-apple comparison.

For freezing_bn, we’re still syncing internally.

alaapdhall79 · December 15, 2021, 10:52am

I actually did. Loss in accuracy is 1.5% in each front. (Ap and MaskAP] better than test_detections_per_image: 50 still not as good.

Thank you, please update if there is any change or if you need me to run any experiment on my end.

Morganh · December 15, 2021, 3:31pm

Hi ,
I go through the topic again. Could you train a resnet18 model on a very small dataset in RTX 3090 to check if it is successful? Please set Freeze_bn: False.

alaapdhall79 · December 15, 2021, 3:32pm

Already did that. Failed. Same error.

Morganh · December 15, 2021, 3:41pm

Thanks for the info. I find the code will force freeze_bn to True during exporting.
Will update to you more if any.

alaapdhall79 · December 15, 2021, 3:52pm

Is it due to a TensorRT layer support issue or is it just a small bug that can be fixed?
as with this setting training from scratch for anyone can be an issue.

alaapdhall79 · December 16, 2021, 7:48am

hey, faced an issue while converting.
So basically I exported successfully .etlt and engine. Now when I run tao convert command it basically fills all the 64 gig available ram and crashes. Why is conversion consuming such huge amount of memory??

This is the log:

[INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 659 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 829 MiB, GPU 659 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1641, GPU 977 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +616, GPU +268, now: CPU 2257, GPU 1245 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
2021-12-16 13:06:02,842 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

64gb to convert doesn’t add up can you put some light on it? Is it loading the entire dataset into memory? doesn’t that defeat the purpose of having batch sizes as dataset can be too big,

Morganh · December 16, 2021, 11:12am

May I know the full log and full command?

More, for freeze_bn=False, there is an issue in the exporting. And it is fixed. It will be available in next release.

alaapdhall79 · December 16, 2021, 12:34pm

Update to this. This feels like another bug,
I ran this command →

tao converter -k nvidia_tlt -d 3,832,1344 -o generate_detections,mask_fcn_logits/BiasAdd -e /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrained_p64/export/trt.fp16.engine -m 1 -t fp16 -i nchw /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrained_p64/model.step-60000.etlt

and got this log in stdout →

2021-12-16 17:54:01,507 [INFO] root: Registry: ['nvcr.io']
2021-12-16 17:54:01,544 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2021-12-16 17:54:01,550 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/koireader/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
[INFO] [MemUsageChange] Init CUDA: CPU +535, GPU +0, now: CPU 541, GPU 644 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 829 MiB, GPU 644 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1641, GPU 962 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +616, GPU +268, now: CPU 2257, GPU 1230 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 253168
[INFO] Total Device Persistent Memory: 77880832
[INFO] Total Scratch Memory: 53721600
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 148 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3276, GPU 1720 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3276, GPU 1728 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3276, GPU 1716 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3276, GPU 1700 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 3213 MiB, GPU 1700 MiB
2021-12-16 18:00:05,210 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

I ran on 3 different systems and here is the surprising output:

RTX 3090 with 64 gig ram. – it filled the ram and swap and crashed
RTX 3090 with 128 gig ram. – it utilized a peak of 90 gig ram and completed
Local machine with 8 gig ram and gtx 1050Ti mobile – converted successfully with hardly 4 gig ram usage

Why would that happen? feels like another bug based on GPU Arch maybe?

Morganh · December 16, 2021, 1:56pm

Above log is not unnormal. The tensorrt engine has been generated successfully.
You can check it.

alaapdhall79 · December 16, 2021, 1:58pm

yes that I know, I am referring to this

It crashed in the first scenario. It took 90 gigs of RAM to produce the output, that’s not normal, right?
the same model on gtx 1050ti took 4gig?

Morganh · December 16, 2021, 2:04pm

Please add “-w” option when OOM happens.
For example,
$ ./tao-converter -k nvidia_tlt -d 3,576,960 -o generate_detections,mask_fcn_logits/BiasAdd -t int8 -c peoplesegnet_resnet50_int8.txt -m 1 -w 100000000 peoplesegnet_resnet50.etlt

Similar topic: Tao-converter mask_rcnn int8 engine creation fails - #2

Morganh · December 17, 2021, 1:57am

For freeze_bn=False, if you want to have the fix as soon as possible, please let us know.

Morganh · December 20, 2021, 8:05am

@alaapdhall79
For exporting error when a model is trained with “freeze_bn=False”, the workaround is as below.

Please modify /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/normalization.py inside the docker

from
693 def call(self, inputs, training=None):

To

693 def call(self, inputs, training=False):

Then, run “mask_rcnn export xxx” inside the docker.

alaapdhall79 · December 21, 2021, 1:18pm

oh, sure. Sorry for the late reply. I’ll try this with the model and get back soon. I shifted to the training of LPRNet thus didn’t update here.
I am also facing some issues in that model, here is the link to new issue I created:

system · January 4, 2022, 1:18pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.